Related (but not helping the immediate question). China Telecom developed something they call HyperDFS. They modified Hadoop and made it possible to run a cluster of NNs, thus eliminating the SPOF.
I don't have the details - the presenter at Hadoop World (last round of sessions, 2nd floor) mentioned that. Didn't give a clear answer when asked about contributing it back. Otis -- Sematext is hiring -- http://sematext.com/about/jobs.html?mls Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR ----- Original Message ---- > From: Steve Loughran <ste...@apache.org> > To: common-user@hadoop.apache.org > Sent: Friday, October 2, 2009 7:22:45 AM > Subject: Re: NameNode high availability > > Stas Oskin wrote: > > Hi. > > > > The HA service (heartbeat) is running on Dom0, and when the primary > > node is down, it basically just starts the VM on the other node. So > > there not supposed to be any time issues. > > > > Can you explain a bit more about your approach, how to automate it for > example? > > * You need to have something " a resource manager" keeping an eye on the NN > from > somewhere. Needless to say, that needs to be fairly HA too. > > * your NN image has to be ready to go > > * when the deployed NA goes away, bring up a new machine with the same image, > hostname *and IP Address*. You can't always pull the latter off, it depends > on > the infrastructure. Without that, you'd need to bring up all the nodes with > DNS > caching set to a short time and update a DNS entry. > > This isn't real HA, its recovery.