On Sat, Oct 3, 2009 at 1:07 AM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Related (but not helping the immediate question).  China Telecom developed 
> something they call HyperDFS.  They modified Hadoop and made it possible to 
> run a cluster of NNs, thus eliminating the SPOF.
>
> I don't have the details - the presenter at Hadoop World (last round of 
> sessions, 2nd floor) mentioned that.  Didn't give a clear answer when asked 
> about contributing it back.
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Steve Loughran <ste...@apache.org>
>> To: common-user@hadoop.apache.org
>> Sent: Friday, October 2, 2009 7:22:45 AM
>> Subject: Re: NameNode high availability
>>
>> Stas Oskin wrote:
>> > Hi.
>> >
>> > The HA service (heartbeat) is running on Dom0, and when the primary
>> > node is down, it basically just starts the VM on the other node. So
>> > there not supposed to be any time issues.
>> >
>> > Can you explain a bit more about your approach, how to automate it for
>> example?
>>
>> * You need to have something " a resource manager" keeping an eye on the NN 
>> from
>> somewhere. Needless to say, that needs to be fairly HA too.
>>
>> * your NN image has to be ready to go
>>
>> * when the deployed NA goes away, bring up a new machine with the same image,
>> hostname *and IP Address*. You can't always pull the latter off, it depends 
>> on
>> the infrastructure. Without that, you'd need to bring up all the nodes with 
>> DNS
>> caching set to a short time and update a DNS entry.
>>
>> This isn't real HA, its recovery.
>
>

Stas,

I think your setup does work but there are some inherent complexities
that are not accounted for. For instance, is that NameNode meta data
is not written to disk in a transactional fashion. Thus, even though
you have block level replication you can not be sure that the
underlying name node data is in a consistent state. (Should not be an
issue with live-migrate though)

I worked with linux-ha, DRBD, OCFS2 and many HA technologies. So let
me summarize my experiences. We know the concept is a normal
standalone system has components that fail, but the mean time to
failure is high, say 99.9 normally disk components fail most often,
solid state or a RAID1 say 99.99. If you look really hard at what the
namenode does, a massive hadoop file system may be only a few hundred
MB to GB of NameNode data. Assuming you have a hot spare restoring a
few GB of data would not take very long. (large volumes of data are
tricky because they take longer to restore)

If XEN LiveMigrate works it is a slam dunk and very cool! You might
not even miss a ping. But lets say your NOT doing a live migration,
down the xen instance and bring it up on the other node. The failover
might take 15 seconds, but it may work more reliably.

>From experience, I will tell you on a different project I went way
cutting edge DRBD, Linux-HA, OCFS2 for multiple mounting the same file
system simultaneously on two nodes. It worked ok, but I sunk weeks of
research into it and it was very complex. OCFS2 conf files, HA Conf
Files, DRDB Conf files, kernel modules, and no matter how much docs I
wrote no one could follow it but me.

My lesson learned was I really did not need that sub-second failover,
nor did i need to be able to duel mount the drive. With those things
stripped out I had better results better. So you might want to
consider. Do you need live-migrate? do you need xen?

This doc tries using less moving parts
http://www.cloudera.com/blog/2009/07/22/hadoop-ha-configuration/
context web presented at Hadoop World NYC they mentioned that with
this setup they had 6 failures, 3 planned 3 unplanned and it worked
each time.

FYI- I did imply that the DRBD XEND approach should work before. Sorry
for being misleading. I find people who have mixed results with some
of these HA tools. I have them working in cases and then in other edge
cases they do not. Your mileage may vary.

Reply via email to