Filed HADOOP-5798.
On Wed, May 6, 2009 at 9:53 PM, Raghu Angadi <rang...@yahoo-inc.com> wrote: > Tamir Kamara wrote: > >> Hi Raghu, >> >> The thread you posted is my original post written when this problem first >> happened on my cluster. I can file a JIRA but I wouldn't be able to >> provide >> information other than what I already posted and I don't have the logs >> from >> that time. Should I still file ? >> > > yes. Jira is a better place for tracking and fixing bugs. I am pretty sure > what you saw is a bug (either already or needs to be fixed). > > Raghu. > > > Thanks, >> Tamir >> >> >> On Tue, May 5, 2009 at 9:14 PM, Raghu Angadi <rang...@yahoo-inc.com> >> wrote: >> >> Tamir, >>> >>> Please file a jira on the problem you are seeing with 'saveLeases'. In >>> the >>> past there have been multiple fixes in this area (HADOOP-3418, >>> HADOOP-3724, >>> and more mentioned in HADOOP-3724). >>> >>> Also refer the thread you started >>> http://www.mail-archive.com/core-user@hadoop.apache.org/msg09397.html >>> >>> I think another user reported the same problem recently. >>> >>> These are indeed very serious and very annoying bugs. >>> >>> Raghu. >>> >>> >>> Tamir Kamara wrote: >>> >>> I didn't have a space problem which led to it (I think). The corruption >>>> started after I bounced the cluster. >>>> At the time, I tried to investigate what led to the corruption but >>>> didn't >>>> find anything useful in the logs besides this line: >>>> saveLeases found path >>>> >>>> >>>> /tmp/temp623789763/tmp659456056/_temporary_attempt_200904211331_0010_r_000002_0/part-00002 >>>> but no matching entry in namespace >>>> >>>> I also tried to recover from the secondary name node files but the >>>> corruption my too wide-spread and I had to format. >>>> >>>> Tamir >>>> >>>> On Mon, May 4, 2009 at 4:48 PM, Stas Oskin <stas.os...@gmail.com> >>>> wrote: >>>> >>>> Hi. >>>> >>>>> Same conditions - where the space has run out and the fs got corrupted? >>>>> >>>>> Or it got corrupted by itself (which is even more worrying)? >>>>> >>>>> Regards. >>>>> >>>>> 2009/5/4 Tamir Kamara <tamirkam...@gmail.com> >>>>> >>>>> I had the same problem a couple of weeks ago with 0.19.1. Had to >>>>> >>>>>> reformat >>>>>> the cluster too... >>>>>> >>>>>> On Mon, May 4, 2009 at 3:50 PM, Stas Oskin <stas.os...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> Hi. >>>>>> >>>>>>> After rebooting the NameNode server, I found out the NameNode doesn't >>>>>>> >>>>>>> start >>>>>> >>>>>> anymore. >>>>>>> >>>>>>> The logs contained this error: >>>>>>> "FSNamesystem initialization failed" >>>>>>> >>>>>>> >>>>>>> I suspected filesystem corruption, so I tried to recover from >>>>>>> SecondaryNameNode. Problem is, it was completely empty! >>>>>>> >>>>>>> I had an issue that might have caused this - the root mount has run >>>>>>> out >>>>>>> >>>>>>> of >>>>>> >>>>>> space. But, both the NameNode and the SecondaryNameNode directories >>>>>>> >>>>>>> were >>>>>> on >>>>>> >>>>>> another mount point with plenty of space there - so it's very strange >>>>>>> >>>>>>> that >>>>>> >>>>>> they were impacted in any way. >>>>>>> >>>>>>> Perhaps the logs, which were located on root mount and as a result, >>>>>>> >>>>>>> could >>>>>> not be written, have caused this? >>>>>> >>>>>>> >>>>>>> To get back HDFS running, i had to format the HDFS (including >>>>>>> manually >>>>>>> erasing the files from DataNodes). While this reasonable in test >>>>>>> environment >>>>>>> - production-wise it would be very bad. >>>>>>> >>>>>>> Any idea why it happened, and what can be done to prevent it in the >>>>>>> >>>>>>> future? >>>>>> >>>>>> I'm using the stable 0.18.3 version of Hadoop. >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> >>>>>>> >> >