[ https://issues.apache.org/jira/browse/YARN-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234235#comment-16234235 ]
Jason Lowe commented on YARN-7102: ---------------------------------- bq. it is indeed a race condition between node heartbeat vs node remove and add. The correct fix is for TestResourceTrackerService.testReconnect to create MockNM by calling MockRM.registerNode, in which a RM drain is called before return. I do not follow the logic here. This looks like a race condition that could happen outside the unit tests as well, so we need more than a unit test update to address it. The problem is that both heartbeat processing a node reconnect processing can modify the response ID. One of them is processed synchronously and the other isn't, so heartbeats can race ahead of the reconnect. That needs to be fixed. One way to address it is to move at least part of the reconnect logic to be processed synchronously in ResourceTrackerService. Seems minimally we need to know which RMNodeImpl we're going with so we can get the right response ID tracked for the next heartbeat from the node. That way even if the heartbeat arrives before the reconnect event asynchronously arrives at RMNodeImpl we have the proper response ID in place to handle the heartbeat correctly. > NM heartbeat stuck when responseId overflows MAX_INT > ---------------------------------------------------- > > Key: YARN-7102 > URL: https://issues.apache.org/jira/browse/YARN-7102 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Botong Huang > Assignee: Botong Huang > Priority: Critical > Attachments: YARN-7102-branch-2.8.v10.patch, > YARN-7102-branch-2.8.v11.patch, YARN-7102-branch-2.8.v9.patch, > YARN-7102-branch-2.v9.patch, YARN-7102-branch-2.v9.patch, > YARN-7102-branch-2.v9.patch, YARN-7102.v1.patch, YARN-7102.v12.patch, > YARN-7102.v2.patch, YARN-7102.v3.patch, YARN-7102.v4.patch, > YARN-7102.v5.patch, YARN-7102.v6.patch, YARN-7102.v7.patch, > YARN-7102.v8.patch, YARN-7102.v9.patch > > > ResponseId overflow problem in NM-RM heartbeat. This is same as AM-RM > heartbeat in YARN-6640, please refer to YARN-6640 for details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org