Ryan Smith wrote:
 > but you dont want to be the one trying to write something just after your
production cluster lost its namenode data.

Steve,

I wasnt planning on trying to solve something like this in production.  I
would assume everyone here is a professional and wouldn't even think of
something like this, but then again maybe not.  I was asking here so i knew
the limitations before i started prototyping failure recovery logic.

-Ryan


That's good to know. Just worrying, that's all

the common failure mode people tend to hit is that their editLog, the list of pending operations, gets truncated when the NN runs out of disk space. When the NN comes back up, it tries to replay this, but the file is truncated and the replay fails. Which means the NN doesnt come back up.

1. Secondary namenodes help here.

2. We really do need Hadoop to recover from this more gracefully, perhaps by not crashing at this point, and instead halting when the replay finishes. You will lose some data, but dont end up having to manually edit the binary edit log to get to the same state. Code and tests would be valued

-steve

Reply via email to