Actually, we discovered today an annoying bug in our test-app, which might
have moved some of the HDFS files to the cluster, including the metadata
files.

I presume it could be the possible reason for such behavior? :)

2009/5/5 Stas Oskin <stas.os...@gmail.com>

> Hi Raghu.
>
> The only lead I have, is that my root mount has filled-up completely.
>
> This in itself should not have caused the metadata corruption, as it has
> been stored on another mount point, which had plenty of space.
>
> But perhaps the fact that NameNode/SecNameNode didn't have enough space for
> logs has caused this?
>
> Unfortunately I was pressed in time to get the cluster up and running, and
> didn't preserve the logs or the image.
> If this happens again - I will surely do so.
>
> Regards.
>
> 2009/5/5 Raghu Angadi <rang...@yahoo-inc.com>
>
>
>> Stas,
>>
>> This is indeed a serious issue.
>>
>> Did you happen to store the the corrupt image? Can this be reproduced
>> using the image?
>>
>> Usually you can recover manually from a corrupt or truncated image. But
>> more importantly we want to find how it got in to this state.
>>
>> Raghu.
>>
>>
>> Stas Oskin wrote:
>>
>>> Hi.
>>>
>>> This quite worry-some issue.
>>>
>>> Can anyone advice on this? I'm really concerned it could appear in
>>> production, and cause a huge data loss.
>>>
>>> Is there any way to recover from this?
>>>
>>> Regards.
>>>
>>> 2009/5/5 Tamir Kamara <tamirkam...@gmail.com>
>>>
>>>  I didn't have a space problem which led to it (I think). The corruption
>>>> started after I bounced the cluster.
>>>> At the time, I tried to investigate what led to the corruption but
>>>> didn't
>>>> find anything useful in the logs besides this line:
>>>> saveLeases found path
>>>>
>>>>
>>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002
>>>> but no matching entry in namespace
>>>>
>>>> I also tried to recover from the secondary name node files but the
>>>> corruption my too wide-spread and I had to format.
>>>>
>>>> Tamir
>>>>
>>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com>
>>>> wrote:
>>>>
>>>>  Hi.
>>>>>
>>>>> Same conditions - where the space has run out and the fs got corrupted?
>>>>>
>>>>> Or it got corrupted by itself (which is even more worrying)?
>>>>>
>>>>> Regards.
>>>>>
>>>>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com>
>>>>>
>>>>>  I had the same problem a couple of weeks ago with 0.19.1. Had to
>>>>>>
>>>>> reformat
>>>>
>>>>> the cluster too...
>>>>>>
>>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>
>>>>>  Hi.
>>>>>>>
>>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't
>>>>>>>
>>>>>> start
>>>>>>
>>>>>>> anymore.
>>>>>>>
>>>>>>> The logs contained this error:
>>>>>>> "FSNamesystem initialization failed"
>>>>>>>
>>>>>>>
>>>>>>> I suspected filesystem corruption, so I tried to recover from
>>>>>>> SecondaryNameNode. Problem is, it was completely empty!
>>>>>>>
>>>>>>> I had an issue that might have caused this - the root mount has run
>>>>>>>
>>>>>> out
>>>>
>>>>> of
>>>>>>
>>>>>>> space. But, both the NameNode and the SecondaryNameNode directories
>>>>>>>
>>>>>> were
>>>>>
>>>>>> on
>>>>>>
>>>>>>> another mount point with plenty of space there - so it's very strange
>>>>>>>
>>>>>> that
>>>>>>
>>>>>>> they were impacted in any way.
>>>>>>>
>>>>>>> Perhaps the logs, which were located on root mount and as a result,
>>>>>>>
>>>>>> could
>>>>>
>>>>>> not be written, have caused this?
>>>>>>>
>>>>>>>
>>>>>>> To get back HDFS running, i had to format the HDFS (including
>>>>>>>
>>>>>> manually
>>>>
>>>>>  erasing the files from DataNodes). While this reasonable in test
>>>>>>> environment
>>>>>>> - production-wise it would be very bad.
>>>>>>>
>>>>>>> Any idea why it happened, and what can be done to prevent it in the
>>>>>>>
>>>>>> future?
>>>>>>
>>>>>>> I'm using the stable 0.18.3 version of Hadoop.
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>>
>>>
>>

Reply via email to