[ https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16227297#comment-16227297 ]
Botong Huang commented on YARN-7102: ------------------------------------ [~jlowe] you are right, it is indeed a race condition between node heartbeat vs node remove and add. The correct fix is for {{TestResourceTrackerService.testReconnect}} to create {{MockNM}} by calling {{MockRM.registerNode}}, in which a RM drain is called before return. Since this is also in trunk, I am uploading YARN-7102.v12.patch for trunk fixing only this unit test. 1. The registerNode, drain and then start heartbeat assumption is taken by existing code in MockRM. Besides many unit tests (discussed in 3 later), a real cluster can violate this assumption when RM is very slow. I am not sure to what extent we need this assumption, or whether it is acceptable to remove this assumption and let RM process NM register synchronously. 2. If we are keeping assumption 1 then my patch will make the assumption more important because RM now don't accept a bigger responseId than expected. When violated, heartbeat will trigger more NM resync. 3. There are many places in existing test code that's calling MockNM constructor, then registerNode and start heartbeat without draining MockRM, thus violating the assumption. Depending on the use cases, most of them might still be fine with my patch in, some becomes flaky (e.g. {{TestResourceTrackerService.testReconnect}}). I can make a more careful pass over them if needed. But my worry is that such a nuance will result in new flaky unit test in the future. Or we can enforce assumption 1 everywhere somehow. Please let me know what you think. Thanks! > NM heartbeat stuck when responseId overflows MAX_INT > ---------------------------------------------------- > > Key: YARN-7102 > URL: https://issues.apache.org/jira/browse/YARN-7102 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Botong Huang > Assignee: Botong Huang > Priority: Critical > Attachments: YARN-7102-branch-2.8.v10.patch, > YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch, > YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, > YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch, > YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch, > YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch, > YARN-7102.v8.patch, YARN-7102.v9.patch > > > ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM > heartbeat in YARN-6640, please refer to YARN-6640 for details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org