[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523667#comment-14523667 ] Xuan Gong commented on YARN-1061: - Close this as duplicate. Let us track the issue from YARN-2578 NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221906#comment-14221906 ] Rohith commented on YARN-1061: -- [~wilfreds] Thanks for pointing out issue in RMHA. Yes this is same as YARN-2578. This issue was got when RM HA was not implemented but now this become very serious problem in RM HA cases. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161160#comment-14161160 ] Wilfred Spiegelenburg commented on YARN-1061: - This is a dupe from YARN-2578. Writes do not time out and they should. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748469#comment-13748469 ] Rohith Sharma K S commented on YARN-1061: - I added all the ipc configurations to log4j.properities file, stil same issue recured. bq. How can NM wait infinitely? I mean what is your connection timeout set to? When I debug the issue , found that it is an issue with IPC layer. This problem ocure in DataNode to NameNode communication also. When process is in T state(for running process, state is S1. This can be seen by ps -p pid -o pid,stat ) i.e process is stopped using kill -stop pid , ipc proxy does not throw any timeout exception. This is becaue , during proxy creation RPC timetime out is set to Zero(hardcoded) at RPC.waitForProtocolProxy method. Settiing rpc timeout to Zero makes ipc call does not throw any exception.Always ipc call(client) retry for sendPing to server(RM). This can be seen in Client.handleTimeout method {noformat} private void handleTimeout(SocketTimeoutException e) throws IOException { if (shouldCloseConnection.get() || !running.get() || rpcTimeout 0) { throw e; } else { sendPing(); } } {noformat} I think RPC timeout should be taken from configurations instead of hardcoding to 0. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737990#comment-13737990 ] Rohith Sharma K S commented on YARN-1061: - Extracted thread dump from NodeManager is {noformat} Node Status Updater prio=10 tid=0x414dc000 nid=0x1d754 in Object.wait() [0x7fefa2dec000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.hadoop.ipc.Client.call(Client.java:1231) - locked 0xdef4f158 (a org.apache.hadoop.ipc.Client$Call) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at $Proxy28.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:70) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at $Proxy30.nodeHeartbeat(Unknown Source) at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:348) {noformat} NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13738601#comment-13738601 ] Omkar Vinit Joshi commented on YARN-1061: - Are you able to reproduce this scenario? Can you please enable DEBUG (HADOOP_ROOT_LOGGER YARN_ROOT_LOGGER) logs and attach them to this jira? How big is your cluster? what is the frequency at which nodemanagers are heartbeating? Can you also attach yarn-site.xml? which version are you using? NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13739178#comment-13739178 ] Rohith Sharma K S commented on YARN-1061: - Actual issue I got in 5 node cluster (1 RM and 5 NM).It is hard to recure scenario for resourcemanager is hang up state in real cluster. The same scenario can be simulated manually bringing resourcemanager to hang up state with help of linux command KILL -STOP RM_PID. All the NM-RM call wait indefinitely. Another case where we can observer indefinite wait is Add new NodeManager when ResouceMangaer is hang up state. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith Sharma K S It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira