NameNode fatal crash - 0.18.1

Jonathan Gray Mon, 15 Dec 2008 10:53:33 -0800

I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase
RegionServer.


At the time of this crash, NameNode and SecondaryNameNode were both hosted
on same master.

We do a nightly backup and about 95% of the way through, HDFS crashed
with...

NameNode shows:

2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to
sync edit log. Fatal Error.
2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error
: All storage directories are inaccessible.
2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:

Every single DataNode shows:

2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
java.io.IOException: Call failed on local exception
        at org.apache.hadoop.ipc.Client.call(Client.java:718)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
        at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
        at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)


This is virtually all of the information I have.  At the same time as the
backup, we have normal HBase traffic and our hourly batch MR jobs.  So slave
nodes were pretty heavily loaded, but don't see anything in DN logs besides
this Call failed.  There are no space issues or anything else, Ganglia shows
high CPU load around this time which has been typical every night, but I
don't see any issues in DN's or NN about expired leases/no heartbeats/etc. 

Is there a way to prevent this failure from happening in the first place?  I
guess just reduce total load across cluster?

Second question is about how to recover once NameNode does fail...

When trying to bring HDFS back up, we get hundreds of:

2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not
found in lease.paths

And then

2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem:
FSNamesystem initialization failed.


Is there a way to recover from this?  As of time of this crash, we had
SecondaryNameNode on the same node.  Moving it to another node with
sufficient memory now, but would that even prevent this kind of FS botching?

Also, my SecondaryNameNode is telling me it cannot connect when trying to do
a checkpoint:

2008-12-15 09:59:48,017 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
doCheckpoint:
2008-12-15 09:59:48,018 ERROR
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)

I changed my masters file to just contain the hostname of the
secondarynamenode, this seems to have properly started the NameNode where I
launched the ./bin/start-dfs.sh from and started SecondaryNameNode on
correct node as well.  But it seems to be unable to connect back to primary.
I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise
there are not links back.  Where would I specify to the secondary where
primary is located?

We're also upgrading to Hadoop 0.19.0 at this time.

Thank you for any help.

Jonathan Gray

NameNode fatal crash - 0.18.1

Reply via email to