[ 
https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227297#comment-16227297
 ] 

Botong Huang commented on YARN-7102:
------------------------------------

[~jlowe] you are right, it is indeed a race condition between node heartbeat vs 
node remove and add. The correct fix is for 
{{TestResourceTrackerService.testReconnect}} to create {{MockNM}} by calling 
{{MockRM.registerNode}}, in which a RM drain is called before return. Since 
this is also in trunk, I am uploading YARN-7102.v12.patch for trunk fixing only 
this unit test. 

1. The registerNode, drain and then start heartbeat assumption is taken by 
existing code in MockRM. Besides many unit tests (discussed in 3 later), a real 
cluster can violate this assumption when RM is very slow. I am not sure to what 
extent we need this assumption, or whether it is acceptable to remove this 
assumption and let RM process NM register synchronously. 

2. If we are keeping assumption 1 then my patch will make the assumption more 
important because RM now don't accept a bigger responseId than expected. When 
violated, heartbeat will trigger more NM resync. 

3. There are many places in existing test code that's calling MockNM 
constructor, then registerNode and start heartbeat without draining MockRM, 
thus violating the assumption. Depending on the use cases, most of them might 
still be fine with my patch in, some becomes flaky (e.g. 
{{TestResourceTrackerService.testReconnect}}). I can make a more careful pass 
over them if needed. But my worry is that such a nuance will result in new 
flaky unit test in the future. Or we can enforce assumption 1 everywhere 
somehow. 

Please let me know what you think. Thanks! 

> NM heartbeat stuck when responseId overflows MAX_INT
> ----------------------------------------------------
>
>                 Key: YARN-7102
>                 URL: https://issues.apache.org/jira/browse/YARN-7102
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Critical
>         Attachments: YARN-7102-branch-2.8.v10.patch, 
> YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, 
> YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch, 
> YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch, 
> YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch, 
> YARN-7102.v8.patch, YARN-7102.v9.patch
>
>
> ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM 
> heartbeat in YARN-6640, please refer to YARN-6640 for details. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to