Re: NameNode fatal crash - 0.18.1

Stefan Will Thu, 08 Jan 2009 15:18:01 -0800

I did some more poking around the name node code to find out what might have
caused it to deem the storage directories to be inaccessible. Apparently it
does this when it catches an IOException while trying to write to or append
to the edit log.


I guess that makes sense, but unfortunately it doesn't actually log the
actual exception:

try {   
   EditLogFileOutputStream eStream = new
EditLogFileOutputStream(getEditNewFile(sd));
   eStream.create();
   editStreams.add(eStream);
} catch (IOException e) {
    // remove stream and this storage directory from list
processIOError(sd);
   it.remove(); }

-- Stefan


> From: Stefan Will <stefan.w...@gmx.net>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Thu, 08 Jan 2009 11:16:29 -0800
> To: <core-user@hadoop.apache.org>
> Subject: Re: NameNode fatal crash - 0.18.1
> 
> Jonathan,
> 
> It looks like the same thing happened to us today (on Hadoop 0.19.0). We
> were running a nightly backup, and at some point, the namenode said:
> 
> <SNIP>
> 2009-01-08 05:57:28,021 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> commitBlockSynchronization(lastblock=blk_2140680350762285267_117754,
> newgenerationstamp=117757, newlength=44866560, newtargets=[10.1.20.
> 116:50010, 10.1.20.111:50010])
> 2009-01-08 05:57:30,270 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
> log. Fatal Error.
> 2009-01-08 05:57:30,882 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> storage directories are inaccessible.
> 2009-01-08 05:57:31,072 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> </SNIP>
> Now when I try and start up the namenode again, I get an EOFException:
> 
> <SNIP>
> 2009-01-08 10:41:45,465 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.ja
> va:549)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:540)
>         at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
> ava:227)
>         at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
> ava:216)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage
> .java:289)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.j
> ava:87)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
> java:311)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java
> :290)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
> )
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:208)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:194)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
> :859)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> </SNIP>
> 
> Did you ever figure out why your backup caused this to happen ? Our backup
> wasn¹t even touching the Hadoop partitions on the master. Were you able to
> recover your DFS state ?
> 
> -- Stefan
> 
> 
>> From: Jonathan Gray <jl...@streamy.com>
>> Reply-To: <core-user@hadoop.apache.org>
>> Date: Mon, 15 Dec 2008 12:35:39 -0800
>> To: <core-user@hadoop.apache.org>
>> Subject: RE: NameNode fatal crash - 0.18.1
>> 
>> I have fixed the issue with the SecondaryNameNode not contacting primary
>> with the 'dfs.http.address' config option.
>> 
>> Other issues still unsolved.
>> 
>>> -----Original Message-----
>>> From: Jonathan Gray [mailto:jl...@streamy.com]
>>> Sent: Monday, December 15, 2008 10:55 AM
>>> To: core-user@hadoop.apache.org
>>> Subject: NameNode fatal crash - 0.18.1
>>> 
>>> I have a 10+1 node cluster, each slave running
>>> DataNode/TaskTracker/HBase
>>> RegionServer.
>>> 
>>> At the time of this crash, NameNode and SecondaryNameNode were both
>>> hosted
>>> on same master.
>>> 
>>> We do a nightly backup and about 95% of the way through, HDFS crashed
>>> with...
>>> 
>>> NameNode shows:
>>> 
>>> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable
>>> to
>>> sync edit log. Fatal Error.
>>> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal
>>> Error
>>> : All storage directories are inaccessible.
>>> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode:
>>> SHUTDOWN_MSG:
>>> 
>>> Every single DataNode shows:
>>> 
>>> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
>>> java.io.IOException: Call failed on local exception
>>>         at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>>         at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
>>>         at
>>> org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
>>>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
>>>         at java.lang.Thread.run(Thread.java:636)
>>> Caused by: java.io.EOFException
>>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>>         at
>>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499
>>> )
>>>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>>> 
>>> 
>>> This is virtually all of the information I have.  At the same time as
>>> the
>>> backup, we have normal HBase traffic and our hourly batch MR jobs.  So
>>> slave
>>> nodes were pretty heavily loaded, but don't see anything in DN logs
>>> besides
>>> this Call failed.  There are no space issues or anything else, Ganglia
>>> shows
>>> high CPU load around this time which has been typical every night, but
>>> I
>>> don't see any issues in DN's or NN about expired leases/no
>>> heartbeats/etc.
>>> 
>>> Is there a way to prevent this failure from happening in the first
>>> place?  I
>>> guess just reduce total load across cluster?
>>> 
>>> Second question is about how to recover once NameNode does fail...
>>> 
>>> When trying to bring HDFS back up, we get hundreds of:
>>> 
>>> 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX
>>> not
>>> found in lease.paths
>>> 
>>> And then
>>> 
>>> 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem:
>>> FSNamesystem initialization failed.
>>> 
>>> 
>>> Is there a way to recover from this?  As of time of this crash, we had
>>> SecondaryNameNode on the same node.  Moving it to another node with
>>> sufficient memory now, but would that even prevent this kind of FS
>>> botching?
>>> 
>>> Also, my SecondaryNameNode is telling me it cannot connect when trying
>>> to do
>>> a checkpoint:
>>> 
>>> 2008-12-15 09:59:48,017 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
>>> doCheckpoint:
>>> 2008-12-15 09:59:48,018 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
>>> java.net.ConnectException: Connection refused
>>>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>>> 
>>> I changed my masters file to just contain the hostname of the
>>> secondarynamenode, this seems to have properly started the NameNode
>>> where I
>>> launched the ./bin/start-dfs.sh from and started SecondaryNameNode on
>>> correct node as well.  But it seems to be unable to connect back to
>>> primary.
>>> I have hadoop-site.xml pointing to fs.default.name of primary, but
>>> otherwise
>>> there are not links back.  Where would I specify to the secondary where
>>> primary is located?
>>> 
>>> We're also upgrading to Hadoop 0.19.0 at this time.
>>> 
>>> Thank you for any help.
>>> 
>>> Jonathan Gray
>

Re: NameNode fatal crash - 0.18.1

Reply via email to