[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Henry Wang updated HDFS-4858: ----------------------------- Attachment: HDFS-4858.patch > HDFS DataNode to NameNode RPC should timeout > -------------------------------------------- > > Key: HDFS-4858 > URL: https://issues.apache.org/jira/browse/HDFS-4858 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha > Environment: Redhat/CentOS 6.4 64 bit Linux > Reporter: Jagane Sundar > Assignee: Jagane Sundar > Priority: Minor > Fix For: 3.0.0, 2.3.0 > > Attachments: HDFS-4858.patch, HDFS-4858.patch > > > The DataNode is configured with ipc.client.ping false and ipc.ping.interval > 14000. This configuration means that the IPC Client (DataNode, in this case) > should timeout in 14000 seconds if the Standby NameNode does not respond to a > sendHeartbeat. > What we observe is this: If the Standby NameNode happens to reboot for any > reason, the DataNodes that are heartbeating to this Standby get stuck forever > while trying to sendHeartbeat. See Stack trace included below. When the > Standby NameNode comes back up, we find that the DataNode never re-registers > with the Standby NameNode. Thereafter failover completely fails. > The desired behavior is that the DataNode's sendHeartbeat should timeout in > 14 seconds, and keep retrying till the Standby NameNode comes back up. When > it does, the DataNode should reconnect, re-register, and offer service. > Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the > method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to > create the DatanodeProtocolPB object. > Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: > Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to > vmhost6-vm1/10.10.10.151:8020): > State: WAITING > Blocked count: 23843 > Waited count: 45676 > Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > org.apache.hadoop.ipc.Client.call(Client.java:1220) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > java.lang.reflect.Method.invoke(Method.java:597) > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) > > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) > > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) > java.lang.Thread.run(Thread.java:662) > DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)