On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Related (but not helping the immediate question). China Telecom developed > something they call HyperDFS. They modified Hadoop and made it possible to > run a cluster of NNs, thus eliminating the SPOF. > > I don't have the details - the presenter at Hadoop World (last round of > sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked > about contributing it back. > > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > ----- Original Message ---- >> From: Steve Loughran <ste...@apache.org> >> To: common-user@hadoop.apache.org >> Sent: Friday, October 2, 2009 7:22:45 AM >> Subject: Re: NameNode high availability >> >> Stas Oskin wrote: >> > Hi. >> > >> > The HA service (heartbeat) is running on Dom0, and when the primary >> > node is down, it basically just starts the VM on the other node. So >> > there not supposed to be any time issues. >> > >> > Can you explain a bit more about your approach, how to automate it for >> example? >> >> * You need to have something " a resource manager" keeping an eye on the NN >> from >> somewhere. Needless to say, that needs to be fairly HA too. >> >> * your NN image has to be ready to go >> >> * when the deployed NA goes away, bring up a new machine with the same image, >> hostname *and IP Address*. You can't always pull the latter off, it depends >> on >> the infrastructure. Without that, you'd need to bring up all the nodes with >> DNS >> caching set to a short time and update a DNS entry. >> >> This isn't real HA, its recovery. > >
Stas, I think your setup does work but there are some inherent complexities that are not accounted for. For instance, is that NameNode meta data is not written to disk in a transactional fashion. Thus, even though you have block level replication you can not be sure that the underlying name node data is in a consistent state. (Should not be an issue with live-migrate though) I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let me summarize my experiences. We know the concept is a normal standalone system has components that fail, but the mean time to failure is high, say 99.9 normally disk components fail most often, solid state or a RAID1 say 99.99. If you look really hard at what the namenode does, a massive hadoop file system may be only a few hundred MB to GB of NameNode data. Assuming you have a hot spare restoring a few GB of data would not take very long. (large volumes of data are tricky because they take longer to restore) If XEN LiveMigrate works it is a slam dunk and very cool! You might not even miss a ping. But lets say your NOT doing a live migration, down the xen instance and bring it up on the other node. The failover might take 15 seconds, but it may work more reliably. >From experience, I will tell you on a different project I went way cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file system simultaneously on two nodes. It worked ok, but I sunk weeks of research into it and it was very complex. OCFS2 conf files, HA Conf Files, DRDB Conf files, kernel modules, and no matter how much docs I wrote no one could follow it but me. My lesson learned was I really did not need that sub-second failover, nor did i need to be able to duel mount the drive. With those things stripped out I had better results better. So you might want to consider. Do you need live-migrate? do you need xen? This doc tries using less moving parts http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/ context web presented at Hadoop World NYC they mentioned that with this setup they had 6 failures, 3 planned 3 unplanned and it worked each time. FYI- I did imply that the DRBD XEND approach should work before. Sorry for being misleading. I find people who have mixed results with some of these HA tools. I have them working in cases and then in other edge cases they do not. Your mileage may vary.