I'm currently running with your option B setup and it seems to be reliable for me (so far). I use a combination of drbd and various hearbeat/LinuxHA scripts that handle the failover process, including a virtual IP for the namenode. I haven't had any real-world unexpected failures to deal with, yet, but all manual testing has had consistent and reliable results.
-paul On Tue, Jul 29, 2008 at 1:54 PM, Ryan Shih <[EMAIL PROTECTED]> wrote: > Dear Hadoop Community -- > > I am wondering if it is already possible or in the plans to add capability > for multiple master nodes. I'm in a situation where I have a master node > that may potentially be in a less than ideal execution and networking > environment. For this reason, it's possible that the master node could die > at any time. On the other hand, the application must always be available. I > have accessible to me other machines but I'm still unclear on the best > method to add reliability. > > Here are a few options that I'm exploring: > a) To create a completely secondary Hadoop cluster that we can flip to when > we detect that the master node has died. This will double hardware costs, > so > if we originally have a 5 node cluster, then we would need to pull 5 more > machines out of somewhere for this decision. This is not the preferable > choice. > b) Just mirror the master node via other always available software, such as > DRBD for real time synchronization. Upon detection we could swap to the > alternate node. > c) Or if Hadoop had some functionality already in place, it would be > fantastic to be able to take advantage of that. I don't know if anything > like this is available but I could not find anything as of yet. It seems to > me, however, that having multiple master nodes would be the direction > Hadoop > needs to go if it is to be useful in high availability applications. I was > told there are some papers on Amazon's Elastic Computing that I'm about to > look for that follow this approach. > > In any case, could someone with experience in solving this type of problem > share how they approached this issue? > > Thanks! >