[ 
https://issues.apache.org/jira/browse/YARN-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated YARN-4148:
-----------------------------
    Attachment: free_in_scheduler_but_not_node_prototype-branch-2.7.patch

Sorry for joining the discussion late, as I missed this originally.  As I 
mentioned in YARN-5290, having the RM wait until the NM confirms container 
release can unnecessarily slow down subsequent allocations on other nodes due 
to scheduler limits (user limit, queue limit, etc.).  We could leverage some 
form of the NM queuing, but I agree it could be confusing when the AM launches 
a container and it doesn't appear to be active afterwards when querying the 
node.

We could have the RM wait until it receives hard confirmation from the NM 
before it releases the resources associated with a container, but that would 
needlessly slow down scheduling in some cases. For example, if a user is at the 
scheduler user limit but releases a container on node A, I don't see why we 
have to wait until that container is confirmed dead over two subsequent NM 
heartbeats (one to tell the NM to shoot it and another to confirm its dead) 
before allowing the user to allocate another container of the same size on node 
B. However I do think it's bad for us to allocate the new container on the same 
node as the released one since we can accidentally overwhelm the node if the 
old container isn't cleaned up fast enough.

Therefore I propose that we go ahead and let the scheduler queues and user 
limit computations update immediately so other nodes can be scheduled, but we 
don't release the resources in the SchedulerNode itself until the node confirms 
a previously running container is dead. IMHO if the RM ever sees a container in 
the RUNNING state on a node, it should never think that node has freed the 
resources for that container until the node itself says that container has 
completed.

Here's a prototype patch against branch-2.7 that is similar to what we're using 
internally to work around this issue.  It goes ahead and releases the resources 
for running containers in the scheduler bookkeeping (i.e.: cluster resource, 
queues, user limits, etc.) but _not_ in the SchedulerNode.  So the RM could 
allocate those resources elsewhere but not on the current node until the node 
reports the container as completed.

NOTE: with any of these "wait until the node says the container is done" 
approaches it's important to get the fix for YARN-5197 or if the NM ever skips 
sending a container completion event the RM will leak those resources on the 
node.

There is an interesting corner case where the RM has handed out a container to 
an AM (i.e.: container is in the ACQUIRED state) but it hasn't seen it running 
on a node yet. If the container is killed by the RM or AM, there's still a 
chance where the container could appear on the node after the RM has considered 
those resources freed. We'll have to decide how to handle that race. One way to 
solve it is to assume the container resources could still be "used" until it 
has had a chance to tell the NM that the container token for that container is 
no longer valid and confirmed in a subsequent NM heartbeat that the container 
has not appeared since. Maybe there's a simpler/faster way to safely free the 
containers resources for that race condition?

> When killing app, RM releases app's resource before they are released by NM
> ---------------------------------------------------------------------------
>
>                 Key: YARN-4148
>                 URL: https://issues.apache.org/jira/browse/YARN-4148
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4148.001.patch, YARN-4148.wip.patch, 
> free_in_scheduler_but_not_node_prototype-branch-2.7.patch
>
>
> When killing a app, RM scheduler releases app's resource as soon as possible, 
> then it might allocate these resource for new requests. But NM have not 
> released them at that time.
> The problem was found when we supported GPU as a resource(YARN-4122).  Test 
> environment: a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3 
> GPUs. Killed app A, then RM released A's 6 GPUs, and allocated 3 GPUs to B. 
> But when B tried to start container on NM, NM found it didn't have 3 GPUs to 
> allocate because it had not released A's GPUs.
> I think the problem also exists for CPU/Memory. It might cause OOM when 
> memory is overused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to