[ 
https://issues.apache.org/jira/browse/HDFS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Hofhansl resolved HDFS-4455.
---------------------------------

    Resolution: Implemented

> Datanode sometimes gives up permanently on Namenode in HA setup
> ---------------------------------------------------------------
>
>                 Key: HDFS-4455
>                 URL: https://issues.apache.org/jira/browse/HDFS-4455
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, ha
>    Affects Versions: 2.0.2-alpha
>            Reporter: Lars Hofhansl
>            Assignee: Juan Yu
>            Priority: Critical
>
> Today we got ourselves into a situation where we hard killed the cluster 
> (kill -9 across the board on all processes) and upon restarting all DNs would 
> permanently give up on of the NNs in our two NN HA setup (using QJM).
> The HA setup is correct (prior to this we failed over the NNs many times for 
> testing). Bouncing the DNs resolved the problem.
> In the logs I see this exception:
> {code}
> 2013-01-29 23:32:49,461 FATAL datanode.DataNode - Initialization failed for 
> block pool Block pool BP-1852726028-<ip>-1358813649047 (storage id 
> DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
> java.io.IOException: Failed on local exception: java.io.IOException: Response 
> is null.; Host Details : local host is: "<host>/<ip>"; destination host is: 
> "<host>":8020; 
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1164)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy10.registerDatanode(Unknown Source)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.registerDatanode(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:149)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:619)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:661)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Response is null.
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:885)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)
> 2013-01-29 23:32:49,463 WARN  datanode.DataNode - Ending block pool service 
> for: Block pool BP-1852726028-<ip>-1358813649047 (storage id 
> DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020
> {code}
> So somehow in BPServiceActor.connectToNNAndHandshake() we made it all the way 
> to register(). Then failed in bpNamenode.registerDatanode(bpRegistration) 
> with an IOException, which is not caught and has the block pool service fail 
> as a whole.
> No doubt that was caused by one of the NNs being a weird state. While that 
> happened the active NN claimed that the FS was corrupted and stayed in safe 
> mode, and DNs only registered with the standby DN. Failing over to the 2nd NN 
> and then restarting the first NN and failing did not change that.
> No amount bouncing/failing over the HA NNs would have the DNs reconnect to 
> one of the NNs.
> In BPServiceActor.register(), should we catch IOException instead of 
> SocketTimeoutException? That way it would continue to retry and eventually 
> connect to the NN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to