Re: NameNode fatal crash - 0.18.1

Stefan Will Thu, 08 Jan 2009 11:17:14 -0800

Jonathan,

It looks like the same thing happened to us today (on Hadoop 0.19.0). We
were running a nightly backup, and at some point, the namenode said:


<SNIP>
2009-01-08 05:57:28,021 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
commitBlockSynchronization(lastblock=blk_2140680350762285267_117754,
newgenerationstamp=117757, newlength=44866560, newtargets=[10.1.20.
116:50010, 10.1.20.111:50010])
2009-01-08 05:57:30,270 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
log. Fatal Error.
2009-01-08 05:57:30,882 FATAL
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
storage directories are inaccessible.
2009-01-08 05:57:31,072 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
</SNIP>
Now when I try and start up the namenode again, I get an EOFException:

<SNIP>
2009-01-08 10:41:45,465 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.ja
va:549)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:540)
        at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
ava:227)
        at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j
ava:216)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage
.java:289)
        at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.j
ava:87)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
java:311)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java
:290)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:208)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:194)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
:859)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
</SNIP>

Did you ever figure out why your backup caused this to happen ? Our backup
wasn¹t even touching the Hadoop partitions on the master. Were you able to
recover your DFS state ?

-- Stefan


> From: Jonathan Gray <jl...@streamy.com>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Mon, 15 Dec 2008 12:35:39 -0800
> To: <core-user@hadoop.apache.org>
> Subject: RE: NameNode fatal crash - 0.18.1
> 
> I have fixed the issue with the SecondaryNameNode not contacting primary
> with the 'dfs.http.address' config option.
> 
> Other issues still unsolved.
> 
>> -----Original Message-----
>> From: Jonathan Gray [mailto:jl...@streamy.com]
>> Sent: Monday, December 15, 2008 10:55 AM
>> To: core-user@hadoop.apache.org
>> Subject: NameNode fatal crash - 0.18.1
>> 
>> I have a 10+1 node cluster, each slave running
>> DataNode/TaskTracker/HBase
>> RegionServer.
>> 
>> At the time of this crash, NameNode and SecondaryNameNode were both
>> hosted
>> on same master.
>> 
>> We do a nightly backup and about 95% of the way through, HDFS crashed
>> with...
>> 
>> NameNode shows:
>> 
>> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable
>> to
>> sync edit log. Fatal Error.
>> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal
>> Error
>> : All storage directories are inaccessible.
>> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode:
>> SHUTDOWN_MSG:
>> 
>> Every single DataNode shows:
>> 
>> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
>> java.io.IOException: Call failed on local exception
>>         at org.apache.hadoop.ipc.Client.call(Client.java:718)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>>         at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
>>         at
>> org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
>>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
>>         at java.lang.Thread.run(Thread.java:636)
>> Caused by: java.io.EOFException
>>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>>         at
>> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499
>> )
>>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
>> 
>> 
>> This is virtually all of the information I have.  At the same time as
>> the
>> backup, we have normal HBase traffic and our hourly batch MR jobs.  So
>> slave
>> nodes were pretty heavily loaded, but don't see anything in DN logs
>> besides
>> this Call failed.  There are no space issues or anything else, Ganglia
>> shows
>> high CPU load around this time which has been typical every night, but
>> I
>> don't see any issues in DN's or NN about expired leases/no
>> heartbeats/etc.
>> 
>> Is there a way to prevent this failure from happening in the first
>> place?  I
>> guess just reduce total load across cluster?
>> 
>> Second question is about how to recover once NameNode does fail...
>> 
>> When trying to bring HDFS back up, we get hundreds of:
>> 
>> 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX
>> not
>> found in lease.paths
>> 
>> And then
>> 
>> 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem:
>> FSNamesystem initialization failed.
>> 
>> 
>> Is there a way to recover from this?  As of time of this crash, we had
>> SecondaryNameNode on the same node.  Moving it to another node with
>> sufficient memory now, but would that even prevent this kind of FS
>> botching?
>> 
>> Also, my SecondaryNameNode is telling me it cannot connect when trying
>> to do
>> a checkpoint:
>> 
>> 2008-12-15 09:59:48,017 ERROR
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
>> doCheckpoint:
>> 2008-12-15 09:59:48,018 ERROR
>> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
>> java.net.ConnectException: Connection refused
>>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>> 
>> I changed my masters file to just contain the hostname of the
>> secondarynamenode, this seems to have properly started the NameNode
>> where I
>> launched the ./bin/start-dfs.sh from and started SecondaryNameNode on
>> correct node as well.  But it seems to be unable to connect back to
>> primary.
>> I have hadoop-site.xml pointing to fs.default.name of primary, but
>> otherwise
>> there are not links back.  Where would I specify to the secondary where
>> primary is located?
>> 
>> We're also upgrading to Hadoop 0.19.0 at this time.
>> 
>> Thank you for any help.
>> 
>> Jonathan Gray

Re: NameNode fatal crash - 0.18.1

Reply via email to