[ 
https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13748469#comment-13748469
 ] 

Rohith Sharma K S commented on YARN-1061:
-----------------------------------------

I added all the ipc configurations to log4j.properities file, stil same issue 
recured.

bq. How can NM wait infinitely? I mean what is your connection timeout set to? 
When I debug the issue , found that it is an issue with IPC layer. This problem 
ocure in DataNode to NameNode communication also.

When process is in T state(for running process, state is S1. This can be seen 
by "ps -p <pid> -o pid,stat" ) i.e process is stopped using "kill -stop <pid>" 
, ipc proxy does not throw any timeout exception.
This is becaue , during proxy creation RPC timetime out is set to 
Zero(hardcoded) at RPC.waitForProtocolProxy method. Settiing rpc timeout to 
Zero makes ipc call does not throw any exception.Always ipc call(client) retry 
for sendPing to server(RM).
This can be seen in Client.handleTimeout method
{noformat}
      private void handleTimeout(SocketTimeoutException e) throws IOException {
        if (shouldCloseConnection.get() || !running.get() || rpcTimeout > 0) {
          throw e;
        } else {
          sendPing();
        }
      }
{noformat}

I think RPC timeout should be taken from configurations instead of hardcoding 
to 0.
                
> NodeManager is indefinitely waiting for nodeHeartBeat() response from 
> ResouceManager.
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-1061
>                 URL: https://issues.apache.org/jira/browse/YARN-1061
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.0.5-alpha
>            Reporter: Rohith Sharma K S
>
> It is observed that in one of the scenario, NodeManger is indefinetly waiting 
> for nodeHeartbeat response from ResouceManger where ResouceManger is in 
> hanged up state.
> NodeManager should get timeout exception instead of waiting indefinetly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to