I was going back to see if a bug ticket had been opened against this
problem, but am not seeing one. Before I go and open one can anyone let
me know if I just failed to find it?
- Adam
On 1/12/11 1:13 PM, Todd Lipcon wrote:
Hi guys,
After Friso's issue a few weeks ago I tried to reproduce this problem
running multiple secondary namenodes but wasn't able to.
Now that two people seem to have had the issue, I'll give it another go.
Has anyone else in the wild seen this issue?
-Todd
On Wed, Jan 12, 2011 at 1:05 PM, Friso van Vollenhoven
<fvanvollenho...@xebia.com <mailto:fvanvollenho...@xebia.com>> wrote:
Hi Adam,
We have probably had the same problem on CDH3b3. Running two
secondary NNs corrupts the edits.new, although it should not give
any trouble. Everything runs fine as long as it stays up, but
restarting the NN will not work because of the corruption. We have
reproduced this once more to verify. With only one secondary NN
running, restarting works fine (also after a couple of days of
operation).
If I am correct your proposed solution would set you back to a image
from about 15-30 minutes before the crash. I think it depends on
what you do with your HDFS (HBase, append only things, ?), whether
that will work out. In our case we are running HBase and going back
in time with the NN image is not very helpful then, because of
splits and compactions removing and adding files all the time. On
append only workloads where you have the option of redoing whatever
it is that you did just before the time of the crash, this could
work. But, please verify with someone with a better understanding of
HDFS internals.
Also, there apparently is a way of healing a corrupt edits file
using your favorite hex editor. There is a thread here:
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201010.mbox/%3caanlktinbhmn1x8dlir-c4ibhja9nh46tns588cqcn...@mail.gmail.com%3E
There is a thread about this (our) problem on the cdh-user Google
group. You could also try to post there.
Friso