[ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234235#comment-16234235
 ] 

Jason Lowe commented on YARN-7102:
----------------------------------

bq. it is indeed a race condition between node heartbeat vs node remove and 
add. The correct fix is for TestResourceTrackerService.testReconnect to create 
MockNM by calling MockRM.registerNode, in which a RM drain is called before 
return.

I do not follow the logic here.  This looks like a race condition that could 
happen outside the unit tests as well, so we need more than a unit test update 
to address it.  The problem is that both heartbeat processing a node reconnect 
processing can modify the response ID.  One of them is processed synchronously 
and the other isn't, so heartbeats can race ahead of the reconnect.  That needs 
to be fixed.

One way to address it is to move at least part of the reconnect logic to be 
processed synchronously in ResourceTrackerService.  Seems minimally we need to 
know which RMNodeImpl we're going with so we can get the right response ID 
tracked for the next heartbeat from the node.  That way even if the heartbeat 
arrives before the reconnect event asynchronously arrives at RMNodeImpl we have 
the proper response ID in place to handle the heartbeat correctly.


> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102-branch-2.8.v10.patch, 
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch, 
> YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch, 
> YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch, 
> YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to