Re: How to rebuild the shared edits directory

Todd Lipcon Tue, 08 May 2012 17:50:14 -0700

Hi Jeff,

Check out HDFS-3077. We'll probably need the most help when it comes
time to do testing. Any testing you can do on the current HA solution,
non-ideal as it may be, is also immensely valuable. For example, if
you can reproduce the case where it didn't exit upon loss of shared
edits, that would also be a bug which would hit the quorum-based
solution.


Thanks
-Todd

On Tue, May 8, 2012 at 4:20 PM, Jeff Whiting <je...@qualtrics.com> wrote:
> Thanks for being patient and listening to my rants.  I'm excited to see hdfs
> continue to move forward.  If the organization I'm working for was willing
> spend some resources to help speed this process up, where should be start
> looking?  I'm sure there are quite a few jiras on these issues.
>
> Most of what we've done with the hadoop eco system has been zookeeper and
> hbase related.
>
> Thanks,
> ~Jeff
>
>
> On 5/8/2012 2:46 PM, Todd Lipcon wrote:
>>
>> On Tue, May 8, 2012 at 12:38 PM, Jeff Whiting<je...@qualtrics.com>  wrote:
>>>
>>> It seems the NN was originally written with the assumption that disks
>>> fail
>>> and stuff happens.  Hence the ability to have multiple directories store
>>> your NN data even though each directory is mostly likely redundant / HA.
>>>
>>> [start rant]
>>>
>>> My opinion is that it is a step backwards that the shared edits wasn't
>>> written with the same assumptions.  If any one problem can take out your
>>> cluster then it isn't HA.  So allowing  a single nfs failure taking down
>>> your cluster and saying make nfs HA, just seems to move the HA problem
>>> not
>>> solve it.  I would expect a true HA solution to be completely self
>>> contained
>>> within the hadoop ecosystem.  All machines fail...eventually and it needs
>>> to
>>> be planned for.  At a minimum a failure of the shared edits should only
>>> disable fail over and provide a recovery mechanism; Ideally the NN should
>>> have been rewritten to be a cluster (similar to zookeeper or ceph) to
>>> enable
>>> HA.
>>>
>>> [end rant]
>>
>> Like I said earlier in the thread, work is already under way on this
>> and should be complete within a number of months.
>>
>> In many practical deployments, what we have already can provide
>> complete HA. In others, like the AWS example you mentioned, we need a
>> bit more, and we're working on it. Hang on a bit longer and it will be
>> good to go.
>>
>> -Todd
>>
>>> Sorry for the rant.  I just really want to see HDFS become complete HA
>>> system without caveats.
>>>
>>> ~Jeff
>>>
>>>
>>> On 5/8/2012 11:44 AM, Todd Lipcon wrote:
>>>>
>>>> On Tue, May 8, 2012 at 10:33 AM, Nathaniel Cook
>>>> <nathani...@qualtrics.com>    wrote:
>>>>>
>>>>> We ran the initializeSharedEdits command and it didn't have any
>>>>> effect, but that my be because of the weird state we got it in.
>>>>>
>>>>> So help me understand: I was under the assumption that if shared edits
>>>>> went away you would lose the ability to failover and that is it. The
>>>>> active namenode would still function but would not failover and all
>>>>> standy namenodes would not try to become active. Is this correct?
>>>>
>>>> Unfortunately that's not the case. If you lose shared edits, your
>>>> cluster should shut down. We currently require the NFS direcory to be
>>>> highly available itself. This is achievable with even pretty
>>>> inexpensive NAS devices from your vendor of choice.
>>>>
>>>> The reason for this behavior is as follows: if the active node loses
>>>> access to the mount, it's unable to distinguish whether the mount
>>>> itself died or if the node just had a local issue which broke the
>>>> mount. Imagine for example that the NFS client had a bug which caused
>>>> the mount to go away. Then, you'd continue running for quite some time
>>>> without writing to shared edits. If your NN then crashed, a failover
>>>> would cause you to revert to an old version of the namespace, and
>>>> you'd have a case of permanent data loss due to divergence of the
>>>> image before and after failover.
>>>>
>>>> There's work under way to remove this restriction which should be
>>>> available for general use some time this summer or early fall, if I
>>>> had to take a guess on timeline.
>>>>
>>>>> If
>>>>> it is the case that namenodes quit when they lose connection to the
>>>>> shared edits dir than doesn't the shared edits becomes the new single
>>>>> point of failure?
>>>>
>>>> Yes, but it's an easy one to resolve. Most of our customers already
>>>> have a NAS device in their datacenter, which has dual heads, dual
>>>> PDUs, etc, and at least 5 9s of uptime. This HA setup is basically the
>>>> same as you see in most enterprise HA systems which rely on shared
>>>> storage.
>>>>
>>>>> Unfortunately we have cleared the logs from this test but we could try
>>>>> to reproduce it.
>>>>
>>>> That would be great, thanks!
>>>>
>>>> -Todd
>>>>
>>>>> On Tue, May 8, 2012 at 10:28 AM, Todd Lipcon<t...@cloudera.com>
>>>>>  wrote:
>>>>>>
>>>>>> On Tue, May 8, 2012 at 7:46 AM, Nathaniel
>>>>>> Cook<nathani...@qualtrics.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> We have be working with an HA hdfs cluster, testing several failover
>>>>>>> scenarios.  We have a small cluster of 4 machines spun up for
>>>>>>> testing.
>>>>>>> We run a namenode on two of the machines and hosted an nfs share on
>>>>>>> the third for the shared edits directory. The fourth machine is just
>>>>>>> a
>>>>>>> datanode. We configured the cluster for automatic failover using
>>>>>>> ZKFC.
>>>>>>> We can start and stop the namenodes with no problems, failover
>>>>>>> happens
>>>>>>> as expected. Then we tested breaking the shared edits directory. We
>>>>>>> stopped the nfs share and then reenabled it. This caused the loss of
>>>>>>> a
>>>>>>> few edits.
>>>>>>
>>>>>> Really? What mount options are you using on your NFS mount?
>>>>>>
>>>>>> The active NN should abort immediately if the shared edits dir
>>>>>> disappears. Do you have logs available from your NNs during this time?
>>>>>>
>>>>>>> This had no effect, as expected, on the namenodes, and the
>>>>>>> cluster functioned normally.
>>>>>>
>>>>>> On the contrary, I'd expect the NN to bail out on the next edit (since
>>>>>> it has no place to reliably fsync it)
>>>>>>
>>>>>>> We stopped the standby namenode and tried
>>>>>>> to start it again, it would not start because of the missing edits.
>>>>>>> No
>>>>>>> matter what we tried we could not rebuild the shared edits directory
>>>>>>> and thus get the second namenode back online. In this state the hdfs
>>>>>>> cluster continued to function but it was no longer an HA cluster. To
>>>>>>> get the cluster back in HA mode we had to reformat the namenode data
>>>>>>> with the shared edits. In this case how do you rebuild the shared
>>>>>>> edits data so you can get the cluster back to an HA mode?
>>>>>>
>>>>>> It sounds like something went wrong with the facility that's supposed
>>>>>> to make the active NN crash if shared edits go away. The logs will
>>>>>> help.
>>>>>>
>>>>>> To answer your question, though, you can run the
>>>>>> "initializeSharedEdits" process again to re-initialize that edits dir.
>>>>>>
>>>>>> Thanks
>>>>>> -Todd
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Nathaniel Cook
>>>>
>>>>
>>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> je...@qualtrics.com
>>>
>>
>>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> je...@qualtrics.com
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: How to rebuild the shared edits directory

Reply via email to