[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-5197: -- Priority: Critical (was: Major) > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe >Priority: Critical > Fix For: 2.8.0, 2.6.5, 2.7.4 > > Attachments: YARN-5197.001.patch, YARN-5197.002.patch, > YARN-5197.003.patch, YARN-5197-branch-2.7.003.patch, > YARN-5197-branch-2.8.003.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-5197: - Fix Version/s: 2.8.0 > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe > Fix For: 2.8.0, 2.6.5, 2.7.4 > > Attachments: YARN-5197-branch-2.7.003.patch, > YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch, > YARN-5197.003.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5197: - Attachment: YARN-5197-branch-2.7.003.patch YARN-5197-branch-2.8.003.patch Thanks for the review and commit, Rohith! Here are patches for branch-2.8 and branch-2.7. I believe the 2.7 patch will work on 2.6 as well. > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-5197-branch-2.7.003.patch, > YARN-5197-branch-2.8.003.patch, YARN-5197.001.patch, YARN-5197.002.patch, > YARN-5197.003.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5197: - Attachment: YARN-5197.003.patch Thanks for the review, Rohith! I updated the patch to add the GUARANTEED check in findLostContainers. > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-5197.001.patch, YARN-5197.002.patch, > YARN-5197.003.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5197: - Attachment: YARN-5197.002.patch Updated the patch for the checkstyle issue. The test failures are tracked by HADOOP-12687. > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-5197.001.patch, YARN-5197.002.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5197) RM leaks containers if running container disappears from node update
[ https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-5197: - Attachment: YARN-5197.001.patch RMNodeImpl checks the list of running containers on the node against launchedContainers but not vice-versa, so containers that disappear on the node are not detected. Here's a patch that detects when the RM thinks there are more containers running on the node than were reported and finds the containers that are lost. Each lost container generates a corresponding aborted completion event for the scheduler. The search for lost containers is only performed when one should be found, so it's low cost for the normal case. I updated MockNM as part of this patch since lots of tests were getting away with lazy mocking of a real NM. They were only specifying container state deltas in the heartbeat and sending empty heartbeats in-between those state changes. With this patch, the RM interprets those empty heartbeats as a loss of all actively running containers and broke those tests. The patch therefore also updates MockNM to track containers and continue reporting them until they have been marked completed just like a real node should. That was simpler to do than update all the users of MockNM to maintain their list of active container statuses explicitly. > RM leaks containers if running container disappears from node update > > > Key: YARN-5197 > URL: https://issues.apache.org/jira/browse/YARN-5197 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.2, 2.6.4 >Reporter: Jason Lowe >Assignee: Jason Lowe > Attachments: YARN-5197.001.patch > > > Once a node reports a container running in a status update, the corresponding > RMNodeImpl will track the container in its launchedContainers map. If the > node somehow misses sending the completed container status to the RM and the > container simply disappears from subsequent heartbeats, the container will > leak in launchedContainers forever and the container completion event will > not be sent to the scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org