[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13898885#comment-13898885 ]
Hudson commented on HDFS-4858: ------------------------------ SUCCESS: Integrated in Hadoop-trunk-Commit #5152 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5152/]) HDFS-4858. HDFS DataNode to NameNode RPC should timeout. Contributed by Henry Wang. (shv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1567535) * /hadoop/common/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/protocolPB/DatanodeProtocolClientSideTranslatorPB.java > HDFS DataNode to NameNode RPC should timeout > -------------------------------------------- > > Key: HDFS-4858 > URL: https://issues.apache.org/jira/browse/HDFS-4858 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha > Environment: Redhat/CentOS 6.4 64 bit Linux > Reporter: Jagane Sundar > Priority: Minor > Fix For: 3.0.0, 2.3.0 > > Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch > > > The DataNode is configured with ipc.client.ping false and ipc.ping.interval > 14000. This configuration means that the IPC Client (DataNode, in this case) > should timeout in 14000 seconds if the Standby NameNode does not respond to a > sendHeartbeat. > What we observe is this: If the Standby NameNode happens to reboot for any > reason, the DataNodes that are heartbeating to this Standby get stuck forever > while trying to sendHeartbeat. See Stack trace included below. When the > Standby NameNode comes back up, we find that the DataNode never re-registers > with the Standby NameNode. Thereafter failover completely fails. > The desired behavior is that the DataNode's sendHeartbeat should timeout in > 14 seconds, and keep retrying till the Standby NameNode comes back up. When > it does, the DataNode should reconnect, re-register, and offer service. > Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the > method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to > create the DatanodeProtocolPB object. > Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: > Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to > vmhost6-vm1/10.10.10.151:8020): > State: WAITING > Blocked count: 23843 > Waited count: 45676 > Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > org.apache.hadoop.ipc.Client.call(Client.java:1220) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > java.lang.reflect.Method.invoke(Method.java:597) > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) > java.lang.Thread.run(Thread.java:662) > DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)