[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803122#comment-14803122
 ] 

MENG DING commented on YARN-4138:
---------------------------------

There is an issue with the current logic:

{code:title=RMContainerImpl.java}

+      if (!changeEvent.isIncrease()) {
+        // if this is a decrease request, if container was increased but not
+        // told to NM, we can consider previous increase is cancelled,
+        // unregister from the containerAllocationExpirer
+        container.containerAllocationExpirer.unregister(container
+            .getContainerId());
+      }  
{code}

Right now, if RM is processing a decrease request on a container, it (intends 
to) cancel any ongoing increase action on the same container by removing the 
container from allocation expirer. This is correct if the target resource is 
less than or equal to the last confirmed resource, otherwise this will cause 
inconsistencies. For example:

1. A container is using 2G
2. AM requests to increase it from 2G --> 8G, and scheduler allocates it and 
issues token to AM
3. AM never uses the token, but requests to decrease the container from 8G --> 
6G, and scheduler goes ahead and decrease the resource to 6G, and also removes 
the container from allocation expirer
4. RM notifies NM to decrease resource to 6G, but since NM is still using 2G, 
the decrease message is ignored by NM
5. Now the container has 6G allocation in RM, but 2G allocation in NM.

In this ticket, we will add a last confirmed resource to RMContainer, and I 
propose to only unregister the container from expirer when the target resource 
is less than or equal to the last confirmed resource. Use the above example, 
after the fix, the behavior should be:

1. A container is using 2G
2. AM requests to increase it from 2G --> 8G, and scheduler allocates it and 
issues token to AM
3. AM requests to decrease the container from 8G --> 6G. Scheduler decreases it 
6G, but does *not* remove the container from allocation expirer
4. The increase token expires, and scheduler reverts back the container 
resource from 6G to 2G.

Let me know if this makes sense or not. If yes, I will come up with a patch 
shortly.

> Roll back container resource allocation after resource increase token expires
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4138
>                 URL: https://issues.apache.org/jira/browse/YARN-4138
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, nodemanager, resourcemanager
>            Reporter: MENG DING
>            Assignee: MENG DING
>         Attachments: YARN-4138-YARN-1197.1.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to