[ 
https://issues.apache.org/jira/browse/YARN-5197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-5197:
-----------------------------
    Attachment: YARN-5197.001.patch

RMNodeImpl checks the list of running containers on the node against 
launchedContainers but not vice-versa, so containers that disappear on the node 
are not detected.  Here's a patch that detects when the RM thinks there are 
more containers running on the node than were reported and finds the containers 
that are lost.  Each lost container generates a corresponding aborted 
completion event for the scheduler.  The search for lost containers is only 
performed when one should be found, so it's low cost for the normal case.

I updated MockNM as part of this patch since lots of tests were getting away 
with lazy mocking of a real NM.  They were only specifying container state 
deltas in the heartbeat and sending empty heartbeats in-between those state 
changes.  With this patch, the RM interprets those empty heartbeats as a loss 
of all actively running containers and broke those tests.  The patch therefore 
also updates MockNM to track containers and continue reporting them until they 
have been marked completed just like a real node should.  That was simpler to 
do than update all the users of MockNM to maintain their list of active 
container statuses explicitly.

> RM leaks containers if running container disappears from node update
> --------------------------------------------------------------------
>
>                 Key: YARN-5197
>                 URL: https://issues.apache.org/jira/browse/YARN-5197
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2, 2.6.4
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-5197.001.patch
>
>
> Once a node reports a container running in a status update, the corresponding 
> RMNodeImpl will track the container in its launchedContainers map.  If the 
> node somehow misses sending the completed container status to the RM and the 
> container simply disappears from subsequent heartbeats, the container will 
> leak in launchedContainers forever and the container completion event will 
> not be sent to the scheduler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to