I have fixed the issue with the SecondaryNameNode not contacting primary
with the 'dfs.http.address' config option.

Other issues still unsolved.

> -----Original Message-----
> From: Jonathan Gray [mailto:jl...@streamy.com]
> Sent: Monday, December 15, 2008 10:55 AM
> To: core-user@hadoop.apache.org
> Subject: NameNode fatal crash - 0.18.1
> 
> I have a 10+1 node cluster, each slave running
> DataNode/TaskTracker/HBase
> RegionServer.
> 
> At the time of this crash, NameNode and SecondaryNameNode were both
> hosted
> on same master.
> 
> We do a nightly backup and about 95% of the way through, HDFS crashed
> with...
> 
> NameNode shows:
> 
> 2008-12-15 01:49:31,178 ERROR org.apache.hadoop.fs.FSNamesystem: Unable
> to
> sync edit log. Fatal Error.
> 2008-12-15 01:49:31,178 FATAL org.apache.hadoop.fs.FSNamesystem: Fatal
> Error
> : All storage directories are inaccessible.
> 2008-12-15 01:49:31,179 INFO org.apache.hadoop.dfs.NameNode:
> SHUTDOWN_MSG:
> 
> Every single DataNode shows:
> 
> 2008-12-15 01:49:32,340 WARN org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Call failed on local exception
>         at org.apache.hadoop.ipc.Client.call(Client.java:718)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>         at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
>         at
> org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:655)
>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2888)
>         at java.lang.Thread.run(Thread.java:636)
> Caused by: java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:499
> )
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:441)
> 
> 
> This is virtually all of the information I have.  At the same time as
> the
> backup, we have normal HBase traffic and our hourly batch MR jobs.  So
> slave
> nodes were pretty heavily loaded, but don't see anything in DN logs
> besides
> this Call failed.  There are no space issues or anything else, Ganglia
> shows
> high CPU load around this time which has been typical every night, but
> I
> don't see any issues in DN's or NN about expired leases/no
> heartbeats/etc.
> 
> Is there a way to prevent this failure from happening in the first
> place?  I
> guess just reduce total load across cluster?
> 
> Second question is about how to recover once NameNode does fail...
> 
> When trying to bring HDFS back up, we get hundreds of:
> 
> 2008-12-15 07:54:13,265 ERROR org.apache.hadoop.dfs.LeaseManager: XXX
> not
> found in lease.paths
> 
> And then
> 
> 2008-12-15 07:54:13,267 ERROR org.apache.hadoop.fs.FSNamesystem:
> FSNamesystem initialization failed.
> 
> 
> Is there a way to recover from this?  As of time of this crash, we had
> SecondaryNameNode on the same node.  Moving it to another node with
> sufficient memory now, but would that even prevent this kind of FS
> botching?
> 
> Also, my SecondaryNameNode is telling me it cannot connect when trying
> to do
> a checkpoint:
> 
> 2008-12-15 09:59:48,017 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in
> doCheckpoint:
> 2008-12-15 09:59:48,018 ERROR
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode:
> java.net.ConnectException: Connection refused
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
> 
> I changed my masters file to just contain the hostname of the
> secondarynamenode, this seems to have properly started the NameNode
> where I
> launched the ./bin/start-dfs.sh from and started SecondaryNameNode on
> correct node as well.  But it seems to be unable to connect back to
> primary.
> I have hadoop-site.xml pointing to fs.default.name of primary, but
> otherwise
> there are not links back.  Where would I specify to the secondary where
> primary is located?
> 
> We're also upgrading to Hadoop 0.19.0 at this time.
> 
> Thank you for any help.
> 
> Jonathan Gray

Reply via email to