I was going back to see if a bug ticket had been opened against this problem, but am not seeing one. Before I go and open one can anyone let me know if I just failed to find it?

- Adam

On 1/12/11 1:13 PM, Todd Lipcon wrote:
Hi guys,

After Friso's issue a few weeks ago I tried to reproduce this problem
running multiple secondary namenodes but wasn't able to.

Now that two people seem to have had the issue, I'll give it another go.

Has anyone else in the wild seen this issue?

-Todd

On Wed, Jan 12, 2011 at 1:05 PM, Friso van Vollenhoven
<fvanvollenho...@xebia.com <mailto:fvanvollenho...@xebia.com>> wrote:

    Hi Adam,

    We have probably had the same problem on CDH3b3. Running two
    secondary NNs corrupts the edits.new, although it should not give
    any trouble. Everything runs fine as long as it stays up, but
    restarting the NN will not work because of the corruption. We have
    reproduced this once more to verify. With only one secondary NN
    running, restarting works fine (also after a couple of days of
    operation).

    If I am correct your proposed solution would set you back to a image
    from about 15-30 minutes before the crash. I think it depends on
    what you do with your HDFS (HBase, append only things, ?), whether
    that will work out. In our case we are running HBase and going back
    in time with the NN image is not very helpful then, because of
    splits and compactions removing and adding files all the time. On
    append only workloads where you have the option of redoing whatever
    it is that you did just before the time of the crash, this could
    work. But, please verify with someone with a better understanding of
    HDFS internals.

    Also, there apparently is a way of healing a corrupt edits file
    using your favorite hex editor. There is a thread here:
    
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3caanlktinbhmn1x8dlir-c4ibhja9nh46tns588cqcn...@mail.gmail.com%3E

    There is a thread about this (our) problem on the cdh-user Google
    group. You could also try to post there.


    Friso

Reply via email to