[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-11 Thread Henry Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Wang updated HDFS-4858:
-

Attachment: HDFS-4858.patch

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Konstantin Boudnik
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-11 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4858:
--

Assignee: (was: Konstantin Boudnik)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-11 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4858:
--

   Resolution: Fixed
Fix Version/s: (was: 2.3.0)
   (was: 3.0.0)
   2.4.0
 Hadoop Flags: Reviewed
   Status: Resolved  (was: Patch Available)

I just committed this. Thank you Henry.

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Henry Wang
Priority: Minor
 Fix For: 2.4.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-11 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4858:
--

Assignee: Henry Wang

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Henry Wang
Priority: Minor
 Fix For: 2.4.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-05 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Status: Open  (was: Patch Available)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-02-05 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Status: Patch Available  (was: Open)

Retesting

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-30 Thread Henry Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Wang updated HDFS-4858:
-

Attachment: HDFS-4858.patch

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-30 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Status: Open  (was: Patch Available)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-30 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Status: Patch Available  (was: Open)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.5-alpha, 2.0.4-alpha, 2.1.0-beta, 3.0.0
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch, HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-14 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Fix Version/s: 3.0.0

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0

 Attachments: HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-14 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Fix Version/s: 2.3.0

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2014-01-14 Thread Konstantin Boudnik (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Boudnik updated HDFS-4858:
-

Assignee: Jagane Sundar

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Assignee: Jagane Sundar
Priority: Minor
 Fix For: 3.0.0, 2.3.0

 Attachments: HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2013-07-24 Thread Jagane Sundar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagane Sundar updated HDFS-4858:


Status: Patch Available  (was: Open)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 2.0.4-alpha, 3.0.0, 2.1.0-beta, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Attachments: HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2013-07-24 Thread Jagane Sundar (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagane Sundar updated HDFS-4858:


Attachment: HDFS-4858.patch

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.1.0-beta, 2.0.4-alpha, 2.0.5-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor
 Attachments: HDFS-4858.patch


 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (HDFS-4858) HDFS DataNode to NameNode RPC should timeout

2013-05-28 Thread Konstantin Shvachko (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Shvachko updated HDFS-4858:
--

Target Version/s: 2.0.5-beta
   Fix Version/s: (was: 2.0.5-beta)
  (was: 3.0.0)

 HDFS DataNode to NameNode RPC should timeout
 

 Key: HDFS-4858
 URL: https://issues.apache.org/jira/browse/HDFS-4858
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.0.0, 2.0.5-beta, 2.0.4-alpha, 2.0.4.1-alpha
 Environment: Redhat/CentOS 6.4 64 bit Linux
Reporter: Jagane Sundar
Priority: Minor

 The DataNode is configured with ipc.client.ping false and ipc.ping.interval 
 14000. This configuration means that the IPC Client (DataNode, in this case) 
 should timeout in 14000 seconds if the Standby NameNode does not respond to a 
 sendHeartbeat.
 What we observe is this: If the Standby NameNode happens to reboot for any 
 reason, the DataNodes that are heartbeating to this Standby get stuck forever 
 while trying to sendHeartbeat. See Stack trace included below. When the 
 Standby NameNode comes back up, we find that the DataNode never re-registers 
 with the Standby NameNode. Thereafter failover completely fails.
 The desired behavior is that the DataNode's sendHeartbeat should timeout in 
 14 seconds, and keep retrying till the Standby NameNode comes back up. When 
 it does, the DataNode should reconnect, re-register, and offer service.
 Specifically, in the class DatanodeProtocolClientSideTranslatorPB.java, the 
 method createNamenode should use RPC.getProtocolProxy and not RPC.getProxy to 
 create the DatanodeProtocolPB object.
 Stack trace of thread stuck in the DataNode after the Standby NN has rebooted:
 Thread 25 (DataNode: [file:///opt/hadoop/data]  heartbeating to 
 vmhost6-vm1/10.10.10.151:8020):
   State: WAITING
   Blocked count: 23843
   Waited count: 45676
   Waiting on org.apache.hadoop.ipc.Client$Call@305ab6c5
   Stack:
 java.lang.Object.wait(Native Method)
 java.lang.Object.wait(Object.java:485)
 org.apache.hadoop.ipc.Client.call(Client.java:1220)
 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 java.lang.reflect.Method.invoke(Method.java:597)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 sun.proxy.$Proxy10.sendHeartbeat(Unknown Source)
 
 org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:167)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:445)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:525)
 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:676)
 java.lang.Thread.run(Thread.java:662)
 DataNode RPC to Standby NameNode never times out. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira