I have a 10+1 node cluster, each slave running DataNode/TaskTracker/HBase RegionServer.
At the time of this crash, NameNode and SecondaryNameNode were both hosted on same master. We do a nightly backup and about 95% of the way through, HDFS crashed with... NameNode shows: 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable to sync edit log. Fatal Error. 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal Error : All storage directories are inaccessible. 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: Every single DataNode shows: 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Call failed on local exception at org.apache.hadoop.ipc.Client.call(Client.java:718) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) at java.lang.Thread.run(Thread.java:636) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) This is virtually all of the information I have. At the same time as the backup, we have normal HBase traffic and our hourly batch MR jobs. So slave nodes were pretty heavily loaded, but don't see anything in DN logs besides this Call failed. There are no space issues or anything else, Ganglia shows high CPU load around this time which has been typical every night, but I don't see any issues in DN's or NN about expired leases/no heartbeats/etc. Is there a way to prevent this failure from happening in the first place? I guess just reduce total load across cluster? Second question is about how to recover once NameNode does fail... When trying to bring HDFS back up, we get hundreds of: 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX not found in lease.paths And then 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: FSNamesystem initialization failed. Is there a way to recover from this? As of time of this crash, we had SecondaryNameNode on the same node. Moving it to another node with sufficient memory now, but would that even prevent this kind of FS botching? Also, my SecondaryNameNode is telling me it cannot connect when trying to do a checkpoint: 2008-12-15 09:59:48,017 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint: 2008-12-15 09:59:48,018 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) I changed my masters file to just contain the hostname of the secondarynamenode, this seems to have properly started the NameNode where I launched the ./bin/start-dfs.sh from and started SecondaryNameNode on correct node as well. But it seems to be unable to connect back to primary. I have hadoop-site.xml pointing to fs.default.name of primary, but otherwise there are not links back. Where would I specify to the secondary where primary is located? We're also upgrading to Hadoop 0.19.0 at this time. Thank you for any help. Jonathan Gray