Re: A few questions about Hadoop and hard-drive failure handling.

Steve Loughran Fri, 24 Jul 2009 06:49:07 -0700

Ryan Smith wrote:

 > but you dont want to be the one trying to write something just after your
production cluster lost its namenode data.


Steve,

I wasnt planning on trying to solve something like this in production.  I
would assume everyone here is a professional and wouldn't even think of
something like this, but then again maybe not.  I was asking here so i knew
the limitations before i started prototyping failure recovery logic.

-Ryan



That's good to know. Just worrying, that's all

the common failure mode people tend to hit is that their editLog, thelist of pending operations, gets truncated when the NN runs out of diskspace. When the NN comes back up, it tries to replay this, but the fileis truncated and the replay fails. Which means the NN doesnt come back up.


1. Secondary namenodes help here.

2. We really do need Hadoop to recover from this more gracefully,perhaps by not crashing at this point, and instead halting when thereplay finishes. You will lose some data, but dont end up having tomanually edit the binary edit log to get to the same state. Code andtests would be valued


-steve

Re: A few questions about Hadoop and hard-drive failure handling.

Reply via email to