[ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176421#comment-16176421
 ] 

Jason Lowe commented on YARN-7102:
----------------------------------

Not a fan of that approach either.  It has a corner case with the same issue, 
and it doesn't solve the following scenario which is probably already happening 
today even without this patch:
# NM heartbeats with response ID 1
# ResourceTrackerService responds with response ID 2, asynchronously posting 
the message with response ID 2 to RMNodeImpl
# RM is slow to process the asynchronous RMNodeStatusEvent containing the 
updated response with ID 2, so the RMNodeImpl still has the old lastResponse 
with response ID 1
# NM performs next heaertbeat with response ID 2
# RM mistakenly believes this is a duplicate heartbeat message from the NM and 
*throws away the heartbeat update*

ResourceTrackerService is already calling RMNode synchronously in two places 
with the updated response, specifically 
RMNode#updateNodeHeartbeatResponseForCleanup and 
RMNode#updateNodeHeartbeatResponseForUpdatedContainers, so there are two 
opportunities for the RMNode to update its tracking of the proper last response 
ID before the heartbeat response is sent to the NM.  Therefore we have at least 
two existing opportunities to get this right even when the response ID wraps 
and also solve the race condition above.


> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102.v1.patch, YARN-7102.v2.patch, 
> YARN-7102.v3.patch, YARN-7102.v4.patch, YARN-7102.v5.patch, YARN-7102.v6.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to