[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2015-05-01 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523667#comment-14523667
 ] 

Xuan Gong commented on YARN-1061:
-

Close this as duplicate. Let us track the issue from YARN-2578

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2014-11-22 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221906#comment-14221906
 ] 

Rohith commented on YARN-1061:
--

[~wilfreds] Thanks for pointing out issue in RMHA. Yes this is same as 
YARN-2578. This issue was got when RM HA was not implemented but now this 
become very serious problem in RM HA cases.


 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2014-10-06 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161160#comment-14161160
 ] 

Wilfred Spiegelenburg commented on YARN-1061:
-

This is a dupe from YARN-2578. Writes do not time out and they should.

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748469#comment-13748469
 ] 

Rohith Sharma K S commented on YARN-1061:
-

I added all the ipc configurations to log4j.properities file, stil same issue 
recured.

bq. How can NM wait infinitely? I mean what is your connection timeout set to? 
When I debug the issue , found that it is an issue with IPC layer. This problem 
ocure in DataNode to NameNode communication also.

When process is in T state(for running process, state is S1. This can be seen 
by ps -p pid -o pid,stat ) i.e process is stopped using kill -stop pid 
, ipc proxy does not throw any timeout exception.
This is becaue , during proxy creation RPC timetime out is set to 
Zero(hardcoded) at RPC.waitForProtocolProxy method. Settiing rpc timeout to 
Zero makes ipc call does not throw any exception.Always ipc call(client) retry 
for sendPing to server(RM).
This can be seen in Client.handleTimeout method
{noformat}
  private void handleTimeout(SocketTimeoutException e) throws IOException {
if (shouldCloseConnection.get() || !running.get() || rpcTimeout  0) {
  throw e;
} else {
  sendPing();
}
  }
{noformat}

I think RPC timeout should be taken from configurations instead of hardcoding 
to 0.

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737990#comment-13737990
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Extracted thread dump from NodeManager is 

{noformat}
Node Status Updater prio=10 tid=0x414dc000 nid=0x1d754 in 
Object.wait() [0x7fefa2dec000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:1231)
- locked 0xdef4f158 (a org.apache.hadoop.ipc.Client$Call)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
at $Proxy28.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at $Proxy30.nodeHeartbeat(Unknown Source)
at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348)
{noformat}

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Omkar Vinit Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738601#comment-13738601
 ] 

Omkar Vinit Joshi commented on YARN-1061:
-

Are you able to reproduce this scenario? Can you please enable DEBUG 
(HADOOP_ROOT_LOGGER  YARN_ROOT_LOGGER) logs and attach them to this jira? How 
big is your cluster? what is the frequency at which nodemanagers are 
heartbeating? Can you also attach yarn-site.xml? which version are you using?

 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.

2013-08-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739178#comment-13739178
 ] 

Rohith Sharma K S commented on YARN-1061:
-

Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure 
scenario for resourcemanager is hang up state in real cluster. 

The same scenario can be simulated manually bringing resourcemanager to hang up 
state with help of linux command KILL -STOP RM_PID. All the NM-RM call 
wait indefinitely. Another case where we can observer indefinite wait is Add 
new NodeManager when ResouceMangaer is hang up state.



 NodeManager is indefinitely waiting for nodeHeartBeat() response from 
 ResouceManager.
 -

 Key: YARN-1061
 URL: https://issues.apache.org/jira/browse/YARN-1061
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.5-alpha
Reporter: Rohith Sharma K S

 It is observed that in one of the scenario, NodeManger is indefinetly waiting 
 for nodeHeartbeat response from ResouceManger where ResouceManger is in 
 hanged up state.
 NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira