I have fixed the issue with the SecondaryNameNode not contacting primary with the 'dfs.http.address' config option.
Other issues still unsolved. > -----Original Message----- > From: Jonathan Gray [mailto:jl...@streamy.com] > Sent: Monday, December 15, 2008 10:55 AM > To: core-user@hadoop.apache.org > Subject: NameNode fatal crash - 0.18.1 > > I have a 10+1 node cluster, each slave running > DataNode/TaskTracker/HBase > RegionServer. > > At the time of this crash, NameNode and SecondaryNameNode were both > hosted > on same master. > > We do a nightly backup and about 95% of the way through, HDFS crashed > with... > > NameNode shows: > > 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable > to > sync edit log. Fatal Error. > 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal > Error > : All storage directories are inaccessible. > 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode: > SHUTDOWN_MSG: > > Every single DataNode shows: > > 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode: > java.io.IOException: Call failed on local exception > at org.apache.hadoop.ipc.Client.call(Client.java:718) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) > at > org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655) > at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888) > at java.lang.Thread.run(Thread.java:636) > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499 > ) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441) > > > This is virtually all of the information I have. At the same time as > the > backup, we have normal HBase traffic and our hourly batch MR jobs. So > slave > nodes were pretty heavily loaded, but don't see anything in DN logs > besides > this Call failed. There are no space issues or anything else, Ganglia > shows > high CPU load around this time which has been typical every night, but > I > don't see any issues in DN's or NN about expired leases/no > heartbeats/etc. > > Is there a way to prevent this failure from happening in the first > place? I > guess just reduce total load across cluster? > > Second question is about how to recover once NameNode does fail... > > When trying to bring HDFS back up, we get hundreds of: > > 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX > not > found in lease.paths > > And then > > 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem: > FSNamesystem initialization failed. > > > Is there a way to recover from this? As of time of this crash, we had > SecondaryNameNode on the same node. Moving it to another node with > sufficient memory now, but would that even prevent this kind of FS > botching? > > Also, my SecondaryNameNode is telling me it cannot connect when trying > to do > a checkpoint: > > 2008-12-15 09:59:48,017 ERROR > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in > doCheckpoint: > 2008-12-15 09:59:48,018 ERROR > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > > I changed my masters file to just contain the hostname of the > secondarynamenode, this seems to have properly started the NameNode > where I > launched the ./bin/start-dfs.sh from and started SecondaryNameNode on > correct node as well. But it seems to be unable to connect back to > primary. > I have hadoop-site.xml pointing to fs.default.name of primary, but > otherwise > there are not links back. Where would I specify to the secondary where > primary is located? > > We're also upgrading to Hadoop 0.19.0 at this time. > > Thank you for any help. > > Jonathan Gray