Re: ERROR dfs.NameNode - java.io.EOFException
) task_200806101759_0370_m_76_0: at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.run(MapTask.java:439) task_200806101759_0370_m_76_0: Caused by: java.io.IOException: No space left on device task_200806101759_0370_m_76_0: at java.io.FileOutputStream.writeBytes(Native Method) task_200806101759_0370_m_76_0: at java.io.FileOutputStream.write(FileOutputStream.java:260) task_200806101759_0370_m_76_0: at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:169) task_200806101759_0370_m_76_0: ... 17 more The weird thing is that none of the nodes in the cluster are out of disk space! (but maybe that's because one of the nodes really ran out of disk temporarily, and then some files got cleaned up, so now disks are no longer full) $ for h in `cat ~/nutch/conf/slaves`; do echo $h; ssh $h df -h | grep mnt; done; localhost this is the NN /dev/sdb 414G 384G 9.1G 98% /mnt /dev/sdc 414G 91G 302G 24% /mnt2 10.252.93.155 /dev/sdb 414G 389G 3.7G 100% /mnt // but it has 3.7GB free! /dev/sdc 414G 93G 300G 24% /mnt2 10.252.239.63 /dev/sdb 414G 388G 4.5G 99% /mnt /dev/sdc 414G 90G 303G 23% /mnt2 10.251.235.224 /dev/sdb 414G 362G 32G 93% /mnt /dev/sdc 414G 92G 301G 24% /mnt2 10.252.230.32 /dev/sdb 414G 189G 205G 48% /mnt /dev/sdc 414G 183G 210G 47% /mnt2 The error when I try starting NN now is an EOFException. Is there something that tells Hadoop how many records to read from edits file? If so, then if that number is greater than the number of records in the edits file, then I don't think I'll be able to fix the problem by removing lines from the edits file. No? I don't mind losing *some* data in HDFS. Can I just remove edits file completely and assume that NN will start (even though it won't know about some portion of data in HDFS which I assume I would then have to find and remove/clean up somehow myself)? I've been running Hadoop for several months now. The first records in the "edits" file seem to be from 2008-06-10 and most of the records seem to be from June 10, while I started seeing errors in the logs on June 23. Here are some details: ## 74K lines [EMAIL PROTECTED] logs]$ wc -l /mnt/nutch/filesystem/name/current/edits 74403 /mnt/nutch/filesystem/name/current/edits ## 454K lines of "strings" [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | wc -l 454271 ## Nothing from before June 10 [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep 2008060 | wc -l 0 ## 139K lines from June (nothing from before June 10) [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep 200806 | wc -l 139524 ## most of the records are from June 10, seems related to those problematic tasks from 20080610 [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep 200806 | grep -c 20080610 130519 ## not much from June 11 (12th, 13th, and so on) [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep 20080611 | wc -l 1834 ## the last few non June 10th lines with the string "200806" in them. I think 27th is when Hadoop completely died. [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | grep 200806 | grep -v 20080610 | tail F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00012 F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00013 F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00014 F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00011 F/user/otis/crawl/segments/20080627171831/crawl_generate/part-5 F/user/otis/crawl/segments/20080627171831/crawl_generate/part-4 F/user/otis/crawl/segments/20080627171831/crawl_generate/_temporary ;/user/otis/crawl/segments/20080627171831/crawl_generate ;/user/otis/crawl/segments/20080627171831/crawl_generate 7/user/otis/crawl/segments/20080627171831/_temporary ## the last few lines [EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | tail 67108864 otis supergroup f/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch 1214687074224 otis supergroup q/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch/part-4 1214687074224 otis Sorry for the long email. Maybe you will see something in all this. Any help would be greatly appreciated. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Saturday, July 5, 2008 10:36:37 AM > Subject: Re: ERROR dfs.NameNode - java.io.EOFException > > Hm
Re: ERROR dfs.NameNode - java.io.EOFException
Hm, tried it (simply edited with vi, removed the last line). I did it with both edits file and fsimage file (I see references to FSEditLog.java and FSImage.java in the stack trace below), but that didn't seem to help. Namenode just doesn't start at all. I can't see any errors in logs due to a failed startup, I just see that exception below when I actually try using Hadoop. What's the damage if I simply remove all of their content? (a la > edits and > fsimage) Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: lohit <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Saturday, July 5, 2008 10:08:57 AM > Subject: Re: ERROR dfs.NameNode - java.io.EOFException > > I remember dhruba telling me about this once. > Yes, Take a backup of the whole current directory. > As you have seen, remove the last line from edits and try to start the > NameNode. > > If it starts, then run fsck to find out which file had the problem. > Thanks, > Lohit > > - Original Message > From: Otis Gospodnetic > To: core-user@hadoop.apache.org > Sent: Friday, July 4, 2008 4:46:57 PM > Subject: Re: ERROR dfs.NameNode - java.io.EOFException > > Hi, > > If it helps with the problem below -- I don't mind losing some data. > For instance, I see my "edits" file has about 74K lines. > Can I just nuke the edits file or remove the last N lines? > > I am looking at the edits file with vi and I see the very last line is very > short - it looks like it was cut off, incomplete, and some of the logs do > mention running out of disk space (even though the NN machine has some more > free > space). > > Could I simply remove this last incomplete line? > > Any help would be greatly appreciated. > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Otis Gospodnetic > > To: core-user@hadoop.apache.org > > Sent: Friday, July 4, 2008 2:00:58 AM > > Subject: ERROR dfs.NameNode - java.io.EOFException > > > > Hi, > > > > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log: > > > > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException > > at java.io.DataInputStream.readFully(DataInputStream.java:180) > > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) > > at > org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) > > at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) > > at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756) > > at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) > > at > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) > > at > > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) > > at > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) > > at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235) > > at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) > > at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) > > at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) > > at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) > > at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) > > > > The exception doesn't include the name and location of the file whose > > reading > is > > failing and causing EOFException :( > > But it looks like it's the fsedit log (the "edits" file, I think). > > > > There is no secondary NN in the cluster. > > > > Is there any way I can revive this NN? Any way to "fix" the corrupt > > "edits" > > file? > > > > Thanks, > > Otis
Re: ERROR dfs.NameNode - java.io.EOFException
I remember dhruba telling me about this once. Yes, Take a backup of the whole current directory. As you have seen, remove the last line from edits and try to start the NameNode. If it starts, then run fsck to find out which file had the problem. Thanks, Lohit - Original Message From: Otis Gospodnetic <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Friday, July 4, 2008 4:46:57 PM Subject: Re: ERROR dfs.NameNode - java.io.EOFException Hi, If it helps with the problem below -- I don't mind losing some data. For instance, I see my "edits" file has about 74K lines. Can I just nuke the edits file or remove the last N lines? I am looking at the edits file with vi and I see the very last line is very short - it looks like it was cut off, incomplete, and some of the logs do mention running out of disk space (even though the NN machine has some more free space). Could I simply remove this last incomplete line? Any help would be greatly appreciated. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Friday, July 4, 2008 2:00:58 AM > Subject: ERROR dfs.NameNode - java.io.EOFException > > Hi, > > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log: > > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) > at > org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) > at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) > at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756) > at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) > at > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) > at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) > at > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) > at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235) > at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) > at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) > at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) > at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) > at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) > > The exception doesn't include the name and location of the file whose reading > is > failing and causing EOFException :( > But it looks like it's the fsedit log (the "edits" file, I think). > > There is no secondary NN in the cluster. > > Is there any way I can revive this NN? Any way to "fix" the corrupt "edits" > file? > > Thanks, > Otis
Re: ERROR dfs.NameNode - java.io.EOFException
Hi, If it helps with the problem below -- I don't mind losing some data. For instance, I see my "edits" file has about 74K lines. Can I just nuke the edits file or remove the last N lines? I am looking at the edits file with vi and I see the very last line is very short - it looks like it was cut off, incomplete, and some of the logs do mention running out of disk space (even though the NN machine has some more free space). Could I simply remove this last incomplete line? Any help would be greatly appreciated. Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Otis Gospodnetic <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Friday, July 4, 2008 2:00:58 AM > Subject: ERROR dfs.NameNode - java.io.EOFException > > Hi, > > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log: > > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) > at > org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) > at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) > at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756) > at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) > at > org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) > at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) > at > org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) > at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235) > at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) > at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) > at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) > at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) > at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) > > The exception doesn't include the name and location of the file whose reading > is > failing and causing EOFException :( > But it looks like it's the fsedit log (the "edits" file, I think). > > There is no secondary NN in the cluster. > > Is there any way I can revive this NN? Any way to "fix" the corrupt "edits" > file? > > Thanks, > Otis
ERROR dfs.NameNode - java.io.EOFException
Hi, Using Hadoop 0.16.2, I am seeing seeing the following in the NN log: 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106) at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90) at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433) at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254) at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855) The exception doesn't include the name and location of the file whose reading is failing and causing EOFException :( But it looks like it's the fsedit log (the "edits" file, I think). There is no secondary NN in the cluster. Is there any way I can revive this NN? Any way to "fix" the corrupt "edits" file? Thanks, Otis