[ 
https://issues.apache.org/jira/browse/YARN-4138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MENG DING updated YARN-4138:
----------------------------
    Attachment: YARN-4138.3.patch

Attach latest patch that addresses [~jianhe] and [~sandflee]'s comments.

I think the issue brought up by [~jianhe] is about race conditions between a 
normal resource decrease and a resource rollback. The proposed fix is to guard 
resource rollback with the same sequence of locks as the normal resource 
decrease, i.e., lock on application first, then on scheduler.

So with the proposed fix, we can walk through the original example:
1. AM asks increase 2G -> 8G, and is approved by RM
2. AM does not increase the container, AM asks to decrease to 1G, and in the 
same time, increase expiration logic is triggered:
* If the normal decrease is processed first: RM decrease 8G -> 1G (allocated 
and lastConfirmed are now set to 1G), and then rollback is processed: RM 
rollback 1G -> 1G (skip)
* If rollback is processed first: RM rollback 8G -> 2G (allocated and 
lastConfirmed are now set to 2G), and then normal decrease is processed: RM 
decrease 2G -> 1G


> Roll back container resource allocation after resource increase token expires
> -----------------------------------------------------------------------------
>
>                 Key: YARN-4138
>                 URL: https://issues.apache.org/jira/browse/YARN-4138
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: api, nodemanager, resourcemanager
>            Reporter: MENG DING
>            Assignee: MENG DING
>         Attachments: YARN-4138-YARN-1197.1.patch, 
> YARN-4138-YARN-1197.2.patch, YARN-4138.3.patch
>
>
> In YARN-1651, after container resource increase token expires, the running 
> container is killed.
> This ticket will change the behavior such that when a container resource 
> increase token expires, the resource allocation of the container will be 
> reverted back to the value before the increase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to