Actually, we discovered today an annoying bug in our test-app, which might have moved some of the HDFS files to the cluster, including the metadata files.
I presume it could be the possible reason for such behavior? :) 2009/5/5 Stas Oskin <stas.os...@gmail.com> > Hi Raghu. > > The only lead I have, is that my root mount has filled-up completely. > > This in itself should not have caused the metadata corruption, as it has > been stored on another mount point, which had plenty of space. > > But perhaps the fact that NameNode/SecNameNode didn't have enough space for > logs has caused this? > > Unfortunately I was pressed in time to get the cluster up and running, and > didn't preserve the logs or the image. > If this happens again - I will surely do so. > > Regards. > > 2009/5/5 Raghu Angadi <rang...@yahoo-inc.com> > > >> Stas, >> >> This is indeed a serious issue. >> >> Did you happen to store the the corrupt image? Can this be reproduced >> using the image? >> >> Usually you can recover manually from a corrupt or truncated image. But >> more importantly we want to find how it got in to this state. >> >> Raghu. >> >> >> Stas Oskin wrote: >> >>> Hi. >>> >>> This quite worry-some issue. >>> >>> Can anyone advice on this? I'm really concerned it could appear in >>> production, and cause a huge data loss. >>> >>> Is there any way to recover from this? >>> >>> Regards. >>> >>> 2009/5/5 Tamir Kamara <tamirkam...@gmail.com> >>> >>> I didn't have a space problem which led to it (I think). The corruption >>>> started after I bounced the cluster. >>>> At the time, I tried to investigate what led to the corruption but >>>> didn't >>>> find anything useful in the logs besides this line: >>>> saveLeases found path >>>> >>>> >>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002 >>>> but no matching entry in namespace >>>> >>>> I also tried to recover from the secondary name node files but the >>>> corruption my too wide-spread and I had to format. >>>> >>>> Tamir >>>> >>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com> >>>> wrote: >>>> >>>> Hi. >>>>> >>>>> Same conditions - where the space has run out and the fs got corrupted? >>>>> >>>>> Or it got corrupted by itself (which is even more worrying)? >>>>> >>>>> Regards. >>>>> >>>>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com> >>>>> >>>>> I had the same problem a couple of weeks ago with 0.19.1. Had to >>>>>> >>>>> reformat >>>> >>>>> the cluster too... >>>>>> >>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com> >>>>>> >>>>> wrote: >>>> >>>>> Hi. >>>>>>> >>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't >>>>>>> >>>>>> start >>>>>> >>>>>>> anymore. >>>>>>> >>>>>>> The logs contained this error: >>>>>>> "FSNamesystem initialization failed" >>>>>>> >>>>>>> >>>>>>> I suspected filesystem corruption, so I tried to recover from >>>>>>> SecondaryNameNode. Problem is, it was completely empty! >>>>>>> >>>>>>> I had an issue that might have caused this - the root mount has run >>>>>>> >>>>>> out >>>> >>>>> of >>>>>> >>>>>>> space. But, both the NameNode and the SecondaryNameNode directories >>>>>>> >>>>>> were >>>>> >>>>>> on >>>>>> >>>>>>> another mount point with plenty of space there - so it's very strange >>>>>>> >>>>>> that >>>>>> >>>>>>> they were impacted in any way. >>>>>>> >>>>>>> Perhaps the logs, which were located on root mount and as a result, >>>>>>> >>>>>> could >>>>> >>>>>> not be written, have caused this? >>>>>>> >>>>>>> >>>>>>> To get back HDFS running, i had to format the HDFS (including >>>>>>> >>>>>> manually >>>> >>>>> erasing the files from DataNodes). While this reasonable in test >>>>>>> environment >>>>>>> - production-wise it would be very bad. >>>>>>> >>>>>>> Any idea why it happened, and what can be done to prevent it in the >>>>>>> >>>>>> future? >>>>>> >>>>>>> I'm using the stable 0.18.3 version of Hadoop. >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> >>> >>