[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Wang updated HDFS-4858: - Attachment: HDFS-4858.patch HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Konstantin Boudnik Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4858: -- Assignee: (was: Konstantin Boudnik) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4858: -- Resolution: Fixed Fix Version/s: (was: 2.3.0) (was: 3.0.0) 2.4.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I just committed this. Thank you Henry. HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Henry Wang Priority: Minor Fix For: 2.4.0 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4858: -- Assignee: Henry Wang HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Henry Wang Priority: Minor Fix For: 2.4.0 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Status: Open (was: Patch Available) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0 Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Status: Patch Available (was: Open) Retesting HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0 Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Wang updated HDFS-4858: - Attachment: HDFS-4858.patch HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Status: Open (was: Patch Available) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0 Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Status: Patch Available (was: Open) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0 Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch, HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Fix Version/s: 3.0.0 HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Fix For: 3.0.0 Attachments: HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Fix Version/s: 2.3.0 HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Boudnik updated HDFS-4858: - Assignee: Jagane Sundar HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Assignee: Jagane Sundar Priority: Minor Fix For: 3.0.0, 2.3.0 Attachments: HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagane Sundar updated HDFS-4858: Status: Patch Available (was: Open) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.0.4-alpha, 3.0.0, 2.1.0-beta, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Attachments: HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagane Sundar updated HDFS-4858: Attachment: HDFS-4858.patch HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor Attachments: HDFS-4858.patch The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout
[ https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Konstantin Shvachko updated HDFS-4858: -- Target Version/s: 2.0.5-beta Fix Version/s: (was: 2.0.5-beta) (was: 3.0.0) HDFS DataNode to NameNode RPC should timeout Key: HDFS-4858 URL: https://issues.apache.org/jira/browse/HDFS-4858 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 3.0.0, 2.0.5-beta, 2.0.4-alpha, 2.0.4.1-alpha Environment: Redhat/CentOS 6.4 64 bit Linux Reporter: Jagane Sundar Priority: Minor The DataNode is configured with ipc.client.ping false and ipc.ping.interval 14000. This configuration means that the IPC Client (DataNode, in this case) should timeout in 14000 seconds if the Standby NameNode does not respond to a sendHeartbeat. What we observe is this: If the Standby NameNode happens to reboot for any reason, the DataNodes that are heartbeating to this Standby get stuck forever while trying to sendHeartbeat. See Stack trace included below. When the Standby NameNode comes back up, we find that the DataNode never re-registers with the Standby NameNode. Thereafter failover completely fails. The desired behavior is that the DataNode's sendHeartbeat should timeout in 14 seconds, and keep retrying till the Standby NameNode comes back up. When it does, the DataNode should reconnect, re-register, and offer service. Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to create the DatanodeProtocolPB object. Stack trace of thread stuck in the DataNode after the Standby NN has rebooted: Thread 25 (DataNode: [file:///opt/hadoop/data] heartbeating to vmhost6-vm1/10.10.10.151:8020): State: WAITING Blocked count: 23843 Waited count: 45676 Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5 Stack: java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:485) org.apache.hadoop.ipc.Client.call(Client.java:1220) org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) java.lang.reflect.Method.invoke(Method.java:597) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) sun.proxy.$Proxy10.sendHeartbeat(Unknown Source) org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525) org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676) java.lang.Thread.run(Thread.java:662) DataNode RPC to Standby NameNode never times out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira