[ 
https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619835#comment-16619835
 ] 

Wangda Tan commented on YARN-8771:
----------------------------------

Nice catch! Thanks [~Tao Yang]. 

Patch LGTM as well. For the test, you can check 
TestCapacitySchedulerWithMultiResourceTypes as examples about how to do unit 
tests for multiple resource types without adding resource-types.xml. 

And I think we should put this to branch-3.1 as well. 

> CapacityScheduler fails to unreserve when cluster resource contains empty 
> resource type
> ---------------------------------------------------------------------------------------
>
>                 Key: YARN-8771
>                 URL: https://issues.apache.org/jira/browse/YARN-8771
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.2.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Critical
>         Attachments: YARN-8771.001.patch, YARN-8771.002.patch
>
>
> We found this problem when cluster is almost but not exhausted (93% used), 
> scheduler kept allocating for an app but always fail to commit, this can 
> blocking requests from other apps and parts of cluster resource can't be used.
> Reproduce this problem:
> (1) use DominantResourceCalculator
> (2) cluster resource has empty resource type, for example: gpu=0
> (3) scheduler allocates container for app1 who has reserved containers and 
> whose queue limit or user limit reached(used + required > limit). 
> Reference codes in RegularContainerAllocator#assignContainer:
> {code:java}
>     // How much need to unreserve equals to:
>     // max(required - headroom, amountNeedUnreserve)
>     Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom());
>     Resource resourceNeedToUnReserve =
>         Resources.max(rc, clusterResource,
>             Resources.subtract(capability, headRoom),
>             currentResoureLimits.getAmountNeededUnreserve());
>     boolean needToUnreserve =
>         Resources.greaterThan(rc, clusterResource,
>             resourceNeedToUnReserve, Resources.none());
> {code}
> For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when 
> {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, 
> needToUnreserve which is the result of {{Resources#greaterThan}} will be 
> {{false}}. This is not reasonable because required resource did exceed the 
> headroom and unreserve is needed.
> After that, when reaching the unreserve process in 
> RegularContainerAllocator#assignContainer, unreserve process will be skipped 
> when shouldAllocOrReserveNewContainer is true (when required containers > 
> reserved containers) and needToUnreserve is wrongly calculated to be false:
> {code:java}
>     if (availableContainers > 0) {
>          if (rmContainer == null && reservationsContinueLooking
>           && node.getLabels().isEmpty()) {
>               // unreserve process can be wrongly skipped when 
> shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required 
> resource did exceed the headroom
>               if (!shouldAllocOrReserveNewContainer || needToUnreserve) { 
>                     ... 
>               }
>          }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to