[ https://issues.apache.org/jira/browse/YARN-8771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619835#comment-16619835 ]
Wangda Tan commented on YARN-8771: ---------------------------------- Nice catch! Thanks [~Tao Yang]. Patch LGTM as well. For the test, you can check TestCapacitySchedulerWithMultiResourceTypes as examples about how to do unit tests for multiple resource types without adding resource-types.xml. And I think we should put this to branch-3.1 as well. > CapacityScheduler fails to unreserve when cluster resource contains empty > resource type > --------------------------------------------------------------------------------------- > > Key: YARN-8771 > URL: https://issues.apache.org/jira/browse/YARN-8771 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Attachments: YARN-8771.001.patch, YARN-8771.002.patch > > > We found this problem when cluster is almost but not exhausted (93% used), > scheduler kept allocating for an app but always fail to commit, this can > blocking requests from other apps and parts of cluster resource can't be used. > Reproduce this problem: > (1) use DominantResourceCalculator > (2) cluster resource has empty resource type, for example: gpu=0 > (3) scheduler allocates container for app1 who has reserved containers and > whose queue limit or user limit reached(used + required > limit). > Reference codes in RegularContainerAllocator#assignContainer: > {code:java} > // How much need to unreserve equals to: > // max(required - headroom, amountNeedUnreserve) > Resource headRoom = Resources.clone(currentResoureLimits.getHeadroom()); > Resource resourceNeedToUnReserve = > Resources.max(rc, clusterResource, > Resources.subtract(capability, headRoom), > currentResoureLimits.getAmountNeededUnreserve()); > boolean needToUnreserve = > Resources.greaterThan(rc, clusterResource, > resourceNeedToUnReserve, Resources.none()); > {code} > For example, resourceNeedToUnReserve can be <8GB, -6 cores, 0 gpu> when > {{headRoom=<0GB, 8 vcores, 0 gpu>}} and {{capacity=<8GB, 2 vcores, 0 gpu>}}, > needToUnreserve which is the result of {{Resources#greaterThan}} will be > {{false}}. This is not reasonable because required resource did exceed the > headroom and unreserve is needed. > After that, when reaching the unreserve process in > RegularContainerAllocator#assignContainer, unreserve process will be skipped > when shouldAllocOrReserveNewContainer is true (when required containers > > reserved containers) and needToUnreserve is wrongly calculated to be false: > {code:java} > if (availableContainers > 0) { > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > // unreserve process can be wrongly skipped when > shouldAllocOrReserveNewContainer=true and needToUnreserve=false but required > resource did exceed the headroom > if (!shouldAllocOrReserveNewContainer || needToUnreserve) { > ... > } > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org