[ https://issues.apache.org/jira/browse/HDFS-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lars Hofhansl resolved HDFS-4455. --------------------------------- Resolution: Implemented > Datanode sometimes gives up permanently on Namenode in HA setup > --------------------------------------------------------------- > > Key: HDFS-4455 > URL: https://issues.apache.org/jira/browse/HDFS-4455 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ha > Affects Versions: 2.0.2-alpha > Reporter: Lars Hofhansl > Assignee: Juan Yu > Priority: Critical > > Today we got ourselves into a situation where we hard killed the cluster > (kill -9 across the board on all processes) and upon restarting all DNs would > permanently give up on of the NNs in our two NN HA setup (using QJM). > The HA setup is correct (prior to this we failed over the NNs many times for > testing). Bouncing the DNs resolved the problem. > In the logs I see this exception: > {code} > 2013-01-29 23:32:49,461 FATAL datanode.DataNode - Initialization failed for > block pool Block pool BP-1852726028-<ip>-1358813649047 (storage id > DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020 > java.io.IOException: Failed on local exception: java.io.IOException: Response > is null.; Host Details : local host is: "<host>/<ip>"; destination host is: > "<host>":8020; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) > at org.apache.hadoop.ipc.Client.call(Client.java:1164) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > at $Proxy10.registerDatanode(Unknown Source) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at $Proxy10.registerDatanode(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.registerDatanode(DatanodeProtocolClientSideTranslatorPB.java:149) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.register(BPServiceActor.java:619) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:661) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.io.IOException: Response is null. > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:885) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813) > 2013-01-29 23:32:49,463 WARN datanode.DataNode - Ending block pool service > for: Block pool BP-1852726028-<ip>-1358813649047 (storage id > DS-60505003-<ip>-50010-1353106051747) service to <host>/<ip>:8020 > {code} > So somehow in BPServiceActor.connectToNNAndHandshake() we made it all the way > to register(). Then failed in bpNamenode.registerDatanode(bpRegistration) > with an IOException, which is not caught and has the block pool service fail > as a whole. > No doubt that was caused by one of the NNs being a weird state. While that > happened the active NN claimed that the FS was corrupted and stayed in safe > mode, and DNs only registered with the standby DN. Failing over to the 2nd NN > and then restarting the first NN and failing did not change that. > No amount bouncing/failing over the HA NNs would have the DNs reconnect to > one of the NNs. > In BPServiceActor.register(), should we catch IOException instead of > SocketTimeoutException? That way it would continue to retry and eventually > connect to the NN. -- This message was sent by Atlassian JIRA (v6.2#6252)