Stas Oskin wrote:
Actually, we discovered today an annoying bug in our test-app, which might
have moved some of the HDFS files to the cluster, including the metadata
files.

oops! presumably it could have removed the image file itself.

I presume it could be the possible reason for such behavior? :)

certainly. It could lead to many different failures. If you had stack trace of the exception, it would be more clear what the error was this time.

Raghu.

2009/5/5 Stas Oskin <stas.os...@gmail.com>

Hi Raghu.

The only lead I have, is that my root mount has filled-up completely.

This in itself should not have caused the metadata corruption, as it has
been stored on another mount point, which had plenty of space.

But perhaps the fact that NameNode/SecNameNode didn't have enough space for
logs has caused this?

Unfortunately I was pressed in time to get the cluster up and running, and
didn't preserve the logs or the image.
If this happens again - I will surely do so.

Regards.

2009/5/5 Raghu Angadi <rang...@yahoo-inc.com>


Stas,

This is indeed a serious issue.

Did you happen to store the the corrupt image? Can this be reproduced
using the image?

Usually you can recover manually from a corrupt or truncated image. But
more importantly we want to find how it got in to this state.

Raghu.


Stas Oskin wrote:

Hi.

This quite worry-some issue.

Can anyone advice on this? I'm really concerned it could appear in
production, and cause a huge data loss.

Is there any way to recover from this?

Regards.

2009/5/5 Tamir Kamara <tamirkam...@gmail.com>

 I didn't have a space problem which led to it (I think). The corruption
started after I bounced the cluster.
At the time, I tried to investigate what led to the corruption but
didn't
find anything useful in the logs besides this line:
saveLeases found path


/tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
but no matching entry in namespace

I also tried to recover from the secondary name node files but the
corruption my too wide-spread and I had to format.

Tamir

On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com>
wrote:

 Hi.
Same conditions - where the space has run out and the fs got corrupted?

Or it got corrupted by itself (which is even more worrying)?

Regards.

2009/5/4 Tamir Kamara <tamirkam...@gmail.com>

 I had the same problem a couple of weeks ago with 0.19.1. Had to
reformat
the cluster too...
On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com>

wrote:
 Hi.
After rebooting the NameNode server, I found out the NameNode doesn't

start

anymore.

The logs contained this error:
"FSNamesystem initialization failed"


I suspected filesystem corruption, so I tried to recover from
SecondaryNameNode. Problem is, it was completely empty!

I had an issue that might have caused this - the root mount has run

out
of
space. But, both the NameNode and the SecondaryNameNode directories

were
on

another mount point with plenty of space there - so it's very strange

that

they were impacted in any way.

Perhaps the logs, which were located on root mount and as a result,

could
not be written, have caused this?

To get back HDFS running, i had to format the HDFS (including

manually
 erasing the files from DataNodes). While this reasonable in test
environment
- production-wise it would be very bad.

Any idea why it happened, and what can be done to prevent it in the

future?

I'm using the stable 0.18.3 version of Hadoop.

Thanks in advance!




Reply via email to