Hi Jeff, Check out HDFS-3077. We'll probably need the most help when it comes time to do testing. Any testing you can do on the current HA solution, non-ideal as it may be, is also immensely valuable. For example, if you can reproduce the case where it didn't exit upon loss of shared edits, that would also be a bug which would hit the quorum-based solution.
Thanks -Todd On Tue, May 8, 2012 at 4:20 PM, Jeff Whiting <je...@qualtrics.com> wrote: > Thanks for being patient and listening to my rants. I'm excited to see hdfs > continue to move forward. If the organization I'm working for was willing > spend some resources to help speed this process up, where should be start > looking? I'm sure there are quite a few jiras on these issues. > > Most of what we've done with the hadoop eco system has been zookeeper and > hbase related. > > Thanks, > ~Jeff > > > On 5/8/2012 2:46 PM, Todd Lipcon wrote: >> >> On Tue, May 8, 2012 at 12:38 PM, Jeff Whiting<je...@qualtrics.com> wrote: >>> >>> It seems the NN was originally written with the assumption that disks >>> fail >>> and stuff happens. Hence the ability to have multiple directories store >>> your NN data even though each directory is mostly likely redundant / HA. >>> >>> [start rant] >>> >>> My opinion is that it is a step backwards that the shared edits wasn't >>> written with the same assumptions. If any one problem can take out your >>> cluster then it isn't HA. So allowing a single nfs failure taking down >>> your cluster and saying make nfs HA, just seems to move the HA problem >>> not >>> solve it. I would expect a true HA solution to be completely self >>> contained >>> within the hadoop ecosystem. All machines fail...eventually and it needs >>> to >>> be planned for. At a minimum a failure of the shared edits should only >>> disable fail over and provide a recovery mechanism; Ideally the NN should >>> have been rewritten to be a cluster (similar to zookeeper or ceph) to >>> enable >>> HA. >>> >>> [end rant] >> >> Like I said earlier in the thread, work is already under way on this >> and should be complete within a number of months. >> >> In many practical deployments, what we have already can provide >> complete HA. In others, like the AWS example you mentioned, we need a >> bit more, and we're working on it. Hang on a bit longer and it will be >> good to go. >> >> -Todd >> >>> Sorry for the rant. I just really want to see HDFS become complete HA >>> system without caveats. >>> >>> ~Jeff >>> >>> >>> On 5/8/2012 11:44 AM, Todd Lipcon wrote: >>>> >>>> On Tue, May 8, 2012 at 10:33 AM, Nathaniel Cook >>>> <nathani...@qualtrics.com> wrote: >>>>> >>>>> We ran the initializeSharedEdits command and it didn't have any >>>>> effect, but that my be because of the weird state we got it in. >>>>> >>>>> So help me understand: I was under the assumption that if shared edits >>>>> went away you would lose the ability to failover and that is it. The >>>>> active namenode would still function but would not failover and all >>>>> standy namenodes would not try to become active. Is this correct? >>>> >>>> Unfortunately that's not the case. If you lose shared edits, your >>>> cluster should shut down. We currently require the NFS direcory to be >>>> highly available itself. This is achievable with even pretty >>>> inexpensive NAS devices from your vendor of choice. >>>> >>>> The reason for this behavior is as follows: if the active node loses >>>> access to the mount, it's unable to distinguish whether the mount >>>> itself died or if the node just had a local issue which broke the >>>> mount. Imagine for example that the NFS client had a bug which caused >>>> the mount to go away. Then, you'd continue running for quite some time >>>> without writing to shared edits. If your NN then crashed, a failover >>>> would cause you to revert to an old version of the namespace, and >>>> you'd have a case of permanent data loss due to divergence of the >>>> image before and after failover. >>>> >>>> There's work under way to remove this restriction which should be >>>> available for general use some time this summer or early fall, if I >>>> had to take a guess on timeline. >>>> >>>>> If >>>>> it is the case that namenodes quit when they lose connection to the >>>>> shared edits dir than doesn't the shared edits becomes the new single >>>>> point of failure? >>>> >>>> Yes, but it's an easy one to resolve. Most of our customers already >>>> have a NAS device in their datacenter, which has dual heads, dual >>>> PDUs, etc, and at least 5 9s of uptime. This HA setup is basically the >>>> same as you see in most enterprise HA systems which rely on shared >>>> storage. >>>> >>>>> Unfortunately we have cleared the logs from this test but we could try >>>>> to reproduce it. >>>> >>>> That would be great, thanks! >>>> >>>> -Todd >>>> >>>>> On Tue, May 8, 2012 at 10:28 AM, Todd Lipcon<t...@cloudera.com> >>>>> wrote: >>>>>> >>>>>> On Tue, May 8, 2012 at 7:46 AM, Nathaniel >>>>>> Cook<nathani...@qualtrics.com> >>>>>> wrote: >>>>>>> >>>>>>> We have be working with an HA hdfs cluster, testing several failover >>>>>>> scenarios. We have a small cluster of 4 machines spun up for >>>>>>> testing. >>>>>>> We run a namenode on two of the machines and hosted an nfs share on >>>>>>> the third for the shared edits directory. The fourth machine is just >>>>>>> a >>>>>>> datanode. We configured the cluster for automatic failover using >>>>>>> ZKFC. >>>>>>> We can start and stop the namenodes with no problems, failover >>>>>>> happens >>>>>>> as expected. Then we tested breaking the shared edits directory. We >>>>>>> stopped the nfs share and then reenabled it. This caused the loss of >>>>>>> a >>>>>>> few edits. >>>>>> >>>>>> Really? What mount options are you using on your NFS mount? >>>>>> >>>>>> The active NN should abort immediately if the shared edits dir >>>>>> disappears. Do you have logs available from your NNs during this time? >>>>>> >>>>>>> This had no effect, as expected, on the namenodes, and the >>>>>>> cluster functioned normally. >>>>>> >>>>>> On the contrary, I'd expect the NN to bail out on the next edit (since >>>>>> it has no place to reliably fsync it) >>>>>> >>>>>>> We stopped the standby namenode and tried >>>>>>> to start it again, it would not start because of the missing edits. >>>>>>> No >>>>>>> matter what we tried we could not rebuild the shared edits directory >>>>>>> and thus get the second namenode back online. In this state the hdfs >>>>>>> cluster continued to function but it was no longer an HA cluster. To >>>>>>> get the cluster back in HA mode we had to reformat the namenode data >>>>>>> with the shared edits. In this case how do you rebuild the shared >>>>>>> edits data so you can get the cluster back to an HA mode? >>>>>> >>>>>> It sounds like something went wrong with the facility that's supposed >>>>>> to make the active NN crash if shared edits go away. The logs will >>>>>> help. >>>>>> >>>>>> To answer your question, though, you can run the >>>>>> "initializeSharedEdits" process again to re-initialize that edits dir. >>>>>> >>>>>> Thanks >>>>>> -Todd >>>>>> -- >>>>>> Todd Lipcon >>>>>> Software Engineer, Cloudera >>>>> >>>>> >>>>> >>>>> -- >>>>> -Nathaniel Cook >>>> >>>> >>>> >>> -- >>> Jeff Whiting >>> Qualtrics Senior Software Engineer >>> je...@qualtrics.com >>> >> >> > > -- > Jeff Whiting > Qualtrics Senior Software Engineer > je...@qualtrics.com > -- Todd Lipcon Software Engineer, Cloudera