Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-06 Thread Otis Gospodnetic
)
task_200806101759_0370_m_76_0:  at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$1.run(MapTask.java:439)
task_200806101759_0370_m_76_0: Caused by: java.io.IOException: No space 
left on device
task_200806101759_0370_m_76_0:  at 
java.io.FileOutputStream.writeBytes(Native Method)
task_200806101759_0370_m_76_0:  at 
java.io.FileOutputStream.write(FileOutputStream.java:260)
task_200806101759_0370_m_76_0: 
at
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:169)
task_200806101759_0370_m_76_0:  ... 17 more

The weird thing is that none of the nodes in the cluster are out of disk space!
(but
maybe that's because one of the nodes really ran out of disk
temporarily, and then some files got cleaned up, so now disks are no
longer full)

$ for h in `cat ~/nutch/conf/slaves`; do echo $h; ssh $h df -h | grep mnt; done;
localhost  this is the NN
/dev/sdb  414G  384G  9.1G  98% /mnt
/dev/sdc  414G   91G  302G  24% /mnt2
10.252.93.155
/dev/sdb  414G  389G  3.7G 100% /mnt  // but it has 3.7GB free!
/dev/sdc  414G   93G  300G  24% /mnt2
10.252.239.63
/dev/sdb  414G  388G  4.5G  99% /mnt
/dev/sdc  414G   90G  303G  23% /mnt2
10.251.235.224
/dev/sdb  414G  362G   32G  93% /mnt
/dev/sdc  414G   92G  301G  24% /mnt2
10.252.230.32
/dev/sdb  414G  189G  205G  48% /mnt
/dev/sdc  414G  183G  210G  47% /mnt2


The
error when I try starting NN now is an EOFException.  Is there
something that tells Hadoop how many records to read from edits file? 
If so, then if that number is greater than the number of records in the
edits file, then I don't think I'll be able to fix the problem by
removing lines from the edits file.  No?

I don't mind losing
*some* data in HDFS.  Can I just remove edits file completely and
assume that NN will start (even though it won't know about some portion
of data in HDFS which I assume I would then have to find and
remove/clean up somehow myself)?

I've been running Hadoop for
several months now.  The first records in the "edits" file seem to be
from 2008-06-10 and most of the records seem to be from June 10, while
I started seeing errors in the logs on June 23.  Here are some details:

## 74K lines
[EMAIL PROTECTED] logs]$ wc -l /mnt/nutch/filesystem/name/current/edits
74403 /mnt/nutch/filesystem/name/current/edits

## 454K lines of "strings"
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | wc 
-l
454271

## Nothing from before June 10
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 2008060 | wc -l
0

## 139K lines from June (nothing from before June 10)
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | wc -l
139524

## most of the records are from June 10, seems related to those problematic 
tasks from 20080610
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | grep -c 20080610
130519

## not much from June 11 (12th, 13th, and so on)
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 20080611 | wc -l
1834

## the last few non June 10th lines with the string "200806" in them.  I think 
27th is when Hadoop completely died.
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | 
grep 200806 | grep -v 20080610 | tail
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00012
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00013
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00014
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-00011
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-5
F/user/otis/crawl/segments/20080627171831/crawl_generate/part-4
F/user/otis/crawl/segments/20080627171831/crawl_generate/_temporary
;/user/otis/crawl/segments/20080627171831/crawl_generate
;/user/otis/crawl/segments/20080627171831/crawl_generate
7/user/otis/crawl/segments/20080627171831/_temporary

## the last few lines
[EMAIL PROTECTED] logs]$ strings /mnt/nutch/filesystem/name/current/edits | tail
67108864
otis
supergroup
f/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch
1214687074224
otis
supergroup
q/user/otis/crawl/segments/20080627171831/_temporary/_task_200806101759_0407_r_04_0/crawl_fetch/part-4
1214687074224
otis

Sorry for the long email.  Maybe you will see something in all this.  Any help 
would be greatly appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Saturday, July 5, 2008 10:36:37 AM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> Hm

Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-05 Thread Otis Gospodnetic
Hm, tried it (simply edited with vi, removed the last line).  I did it with 
both edits file and fsimage file (I see references to FSEditLog.java and 
FSImage.java in the stack trace below), but that didn't seem to help.  Namenode 
just doesn't start at all.  I can't see any errors in logs due to a failed 
startup, I just see that exception below when I actually try using Hadoop.

What's the damage if I simply remove all of their content? (a la > edits and > 
fsimage)


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: lohit <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Saturday, July 5, 2008 10:08:57 AM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> I remember dhruba telling me about this once.
> Yes, Take a backup of the whole current directory.
> As you have seen, remove the last line from edits and try to start the 
> NameNode. 
> 
> If it starts, then run fsck to find out which file had the problem. 
> Thanks,
> Lohit
> 
> - Original Message 
> From: Otis Gospodnetic 
> To: core-user@hadoop.apache.org
> Sent: Friday, July 4, 2008 4:46:57 PM
> Subject: Re: ERROR dfs.NameNode - java.io.EOFException
> 
> Hi,
> 
> If it helps with the problem below -- I don't mind losing some data.
> For instance, I see my "edits" file has about 74K lines.
> Can I just nuke the edits file or remove the last N lines?
> 
> I am looking at the edits file with vi and I see the very last line is very 
> short - it looks like it was cut off, incomplete, and some of the logs do 
> mention running out of disk space (even though the NN machine has some more 
> free 
> space).
> 
> Could I simply remove this last incomplete line?
> 
> Any help would be greatly appreciated.
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
> > From: Otis Gospodnetic 
> > To: core-user@hadoop.apache.org
> > Sent: Friday, July 4, 2008 2:00:58 AM
> > Subject: ERROR dfs.NameNode - java.io.EOFException
> > 
> > Hi,
> > 
> > Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> > 
> > 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> > at java.io.DataInputStream.readFully(DataInputStream.java:180)
> > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> > at 
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> > at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> > at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> > at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> > at 
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> > at 
> > org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> > at 
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> > at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> > at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> > at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> > at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> > at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> > at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> > 
> > The exception doesn't include the name and location of the file whose 
> > reading 
> is 
> > failing and causing EOFException :(
> > But it looks like it's the fsedit log (the "edits" file, I think).
> > 
> > There is no secondary NN in the cluster.
> > 
> > Is there any way I can revive this NN?  Any way to "fix" the corrupt 
> > "edits" 
> > file?
> > 
> > Thanks,
> > Otis



Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-05 Thread lohit
I remember dhruba telling me about this once.
Yes, Take a backup of the whole current directory.
As you have seen, remove the last line from edits and try to start the 
NameNode. 
If it starts, then run fsck to find out which file had the problem. 
Thanks,
Lohit

- Original Message 
From: Otis Gospodnetic <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, July 4, 2008 4:46:57 PM
Subject: Re: ERROR dfs.NameNode - java.io.EOFException

Hi,

If it helps with the problem below -- I don't mind losing some data.
For instance, I see my "edits" file has about 74K lines.
Can I just nuke the edits file or remove the last N lines?

I am looking at the edits file with vi and I see the very last line is very 
short - it looks like it was cut off, incomplete, and some of the logs do 
mention running out of disk space (even though the NN machine has some more 
free space).

Could I simply remove this last incomplete line?

Any help would be greatly appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, July 4, 2008 2:00:58 AM
> Subject: ERROR dfs.NameNode - java.io.EOFException
> 
> Hi,
> 
> Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> 
> 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> at 
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> at 
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> at 
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> 
> The exception doesn't include the name and location of the file whose reading 
> is 
> failing and causing EOFException :(
> But it looks like it's the fsedit log (the "edits" file, I think).
> 
> There is no secondary NN in the cluster.
> 
> Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits" 
> file?
> 
> Thanks,
> Otis


Re: ERROR dfs.NameNode - java.io.EOFException

2008-07-04 Thread Otis Gospodnetic
Hi,

If it helps with the problem below -- I don't mind losing some data.
For instance, I see my "edits" file has about 74K lines.
Can I just nuke the edits file or remove the last N lines?

I am looking at the edits file with vi and I see the very last line is very 
short - it looks like it was cut off, incomplete, and some of the logs do 
mention running out of disk space (even though the NN machine has some more 
free space).

Could I simply remove this last incomplete line?

Any help would be greatly appreciated.

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, July 4, 2008 2:00:58 AM
> Subject: ERROR dfs.NameNode - java.io.EOFException
> 
> Hi,
> 
> Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:
> 
> 2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
> at java.io.DataInputStream.readFully(DataInputStream.java:180)
> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
> at 
> org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
> at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
> at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
> at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
> at 
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
> at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
> at 
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
> at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
> at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
> at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
> at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
> at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
> 
> The exception doesn't include the name and location of the file whose reading 
> is 
> failing and causing EOFException :(
> But it looks like it's the fsedit log (the "edits" file, I think).
> 
> There is no secondary NN in the cluster.
> 
> Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits" 
> file?
> 
> Thanks,
> Otis



ERROR dfs.NameNode - java.io.EOFException

2008-07-03 Thread Otis Gospodnetic
Hi,

Using Hadoop 0.16.2, I am seeing seeing the following in the NN log:

2008-07-03 19:46:26,715 ERROR dfs.NameNode - java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:756)
at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:235)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:176)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:162)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

The exception doesn't include the name and location of the file whose reading 
is failing and causing EOFException :(
But it looks like it's the fsedit log (the "edits" file, I think).

There is no secondary NN in the cluster.

Is there any way I can revive this NN?  Any way to "fix" the corrupt "edits" 
file?

Thanks,
Otis