Jonathan, It looks like the same thing happened to us today (on Hadoop 0.19.0). We were running a nightly backup, and at some point, the namenode said:
<SNIP> 2009-01-08 05:57:28,021 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_2140680350762285267_117754, newgenerationstamp=117757, newlength=44866560, newtargets=[10.1.20. 116:50010, 10.1.20.111:50010]) 2009-01-08 05:57:30,270 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log. Fatal Error. 2009-01-08 05:57:30,882 FATAL org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All storage directories are inaccessible. 2009-01-08 05:57:31,072 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: </SNIP> Now when I try and start up the namenode again, I get an EOFException: <SNIP> 2009-01-08 10:41:45,465 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.ja va:549) at org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:540) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j ava:227) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.j ava:216) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage .java:289) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.j ava:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem. java:311) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java :290) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163 ) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java :859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) </SNIP> Did you ever figure out why your backup caused this to happen ? Our backup wasn¹t even touching the Hadoop partitions on the master. Were you able to recover your DFS state ? -- Stefan > From: Jonathan Gray <jl...@streamy.com> > Reply-To: <core-user@hadoop.apache.org> > Date: Mon, 15 Dec 2008 12:35:39 -0800 > To: <core-user@hadoop.apache.org> > Subject: RE: NameNode fatal crash - 0.18.1 > > I have fixed the issue with the SecondaryNameNode not contacting primary > with the 'dfs.http.address' config option. > > Other issues still unsolved. > >> -----Original Message----- >> From: Jonathan Gray [mailto:jl...@streamy.com] >> Sent: Monday, December 15, 2008 10:55 AM >> To: core-user@hadoop.apache.org >> Subject: NameNode fatal crash - 0.18.1 >> >> I have a 10+1 node cluster, each slave running >> DataNode/TaskTracker/HBase >> RegionServer. >> >> At the time of this crash, NameNode and SecondaryNameNode were both >> hosted >> on same master. >> >> We do a nightly backup and about 95% of the way through, HDFS crashed >> with... >> >> NameNode shows: >> >> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable >> to >> sync edit log. Fatal Error. >> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal >> Error >> : All storage directories are inaccessible. >> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: >> SHUTDOWN_MSG: >> >> Every single DataNode shows: >> >> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: >> java.io.IOException: Call failed on local exception >> at org.apache.hadoop.ipc.Client.call(Client.java:718) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) >> at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) >> at >> org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) >> at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) >> at java.lang.Thread.run(Thread.java:636) >> Caused by: java.io.EOFException >> at java.io.DataInputStream.readInt(DataInputStream.java:392) >> at >> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499 >> ) >> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) >> >> >> This is virtually all of the information I have. At the same time as >> the >> backup, we have normal HBase traffic and our hourly batch MR jobs. So >> slave >> nodes were pretty heavily loaded, but don't see anything in DN logs >> besides >> this Call failed. There are no space issues or anything else, Ganglia >> shows >> high CPU load around this time which has been typical every night, but >> I >> don't see any issues in DN's or NN about expired leases/no >> heartbeats/etc. >> >> Is there a way to prevent this failure from happening in the first >> place? I >> guess just reduce total load across cluster? >> >> Second question is about how to recover once NameNode does fail... >> >> When trying to bring HDFS back up, we get hundreds of: >> >> 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX >> not >> found in lease.paths >> >> And then >> >> 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: >> FSNamesystem initialization failed. >> >> >> Is there a way to recover from this? As of time of this crash, we had >> SecondaryNameNode on the same node. Moving it to another node with >> sufficient memory now, but would that even prevent this kind of FS >> botching? >> >> Also, my SecondaryNameNode is telling me it cannot connect when trying >> to do >> a checkpoint: >> >> 2008-12-15 09:59:48,017 ERROR >> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in >> doCheckpoint: >> 2008-12-15 09:59:48,018 ERROR >> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: >> java.net.ConnectException: Connection refused >> at java.net.PlainSocketImpl.socketConnect(Native Method) >> >> I changed my masters file to just contain the hostname of the >> secondarynamenode, this seems to have properly started the NameNode >> where I >> launched the ./bin/start-dfs.sh from and started SecondaryNameNode on >> correct node as well. But it seems to be unable to connect back to >> primary. >> I have hadoop-site.xml pointing to fs.default.name of primary, but >> otherwise >> there are not links back. Where would I specify to the secondary where >> primary is located? >> >> We're also upgrading to Hadoop 0.19.0 at this time. >> >> Thank you for any help. >> >> Jonathan Gray