Re: Failed namenode restart, recovering from corrupt edits file?

Todd Lipcon Wed, 23 Feb 2011 14:30:35 -0800

Hi Adam,

I've tried to reproduce the issue a few more times but still been unable. I
have an open task internally to look into this more in the coming months,
but no JIRA has been filed externally that I'm aware of, since we haven't
been able to repro.


-Todd

On Tue, Feb 22, 2011 at 10:56 AM, Adam Phelps <a...@opendns.com> wrote:

> I was going back to see if a bug ticket had been opened against this
> problem, but am not seeing one.  Before I go and open one can anyone let me
> know if I just failed to find it?
>
> - Adam
>
>
> On 1/12/11 1:13 PM, Todd Lipcon wrote:
>
>> Hi guys,
>>
>> After Friso's issue a few weeks ago I tried to reproduce this problem
>> running multiple secondary namenodes but wasn't able to.
>>
>> Now that two people seem to have had the issue, I'll give it another go.
>>
>> Has anyone else in the wild seen this issue?
>>
>> -Todd
>>
>> On Wed, Jan 12, 2011 at 1:05 PM, Friso van Vollenhoven
>> <fvanvollenho...@xebia.com <mailto:fvanvollenho...@xebia.com>> wrote:
>>
>>    Hi Adam,
>>
>>    We have probably had the same problem on CDH3b3. Running two
>>    secondary NNs corrupts the edits.new, although it should not give
>>    any trouble. Everything runs fine as long as it stays up, but
>>    restarting the NN will not work because of the corruption. We have
>>    reproduced this once more to verify. With only one secondary NN
>>    running, restarting works fine (also after a couple of days of
>>    operation).
>>
>>    If I am correct your proposed solution would set you back to a image
>>    from about 15-30 minutes before the crash. I think it depends on
>>    what you do with your HDFS (HBase, append only things, ?), whether
>>    that will work out. In our case we are running HBase and going back
>>    in time with the NN image is not very helpful then, because of
>>    splits and compactions removing and adding files all the time. On
>>    append only workloads where you have the option of redoing whatever
>>    it is that you did just before the time of the crash, this could
>>    work. But, please verify with someone with a better understanding of
>>    HDFS internals.
>>
>>    Also, there apparently is a way of healing a corrupt edits file
>>    using your favorite hex editor. There is a thread here:
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3caanlktinbhmn1x8dlir-c4ibhja9nh46tns588cqcn...@mail.gmail.com%3E
>>
>>    There is a thread about this (our) problem on the cdh-user Google
>>    group. You could also try to post there.
>>
>>
>>    Friso
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Failed namenode restart, recovering from corrupt edits file?

Reply via email to