[ https://issues.apache.org/jira/browse/YARN-11082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bo Li updated YARN-11082: ------------------------- Description: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote}2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_000001 container=null queue=root.a.a1.a2 clusterResource=<memory:175117312, vCores:40222> type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource=<memory:3381248, vCores:687> exceeded maxResourceLimit of the queue =<memory:3420315, vCores:687> 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false {quote} Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) {quote} clusterResource = <memory:175117312, vCores:40222> usedExceptKillable = <memory:3381248, vCores:687> currentLimitResource = <memory:3420315, vCores:687> currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario was: We ued cluster resource as denominator to decide which resoure is dominated in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are configed differently. {quote} 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: assignedContainer application attempt=appattempt_1637412555366_1588993_000001 container=null queue=root.a.a1.a2 clusterResource=<memory:175117312, vCores:40222> type=RACK_LOCAL requestedPartition=xx 2021-12-09 10:24:37,069 DEBUG org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: Used resource=<memory:3381248, vCores:687> exceeded maxResourceLimit of the queue =<memory:3420315, vCores:687> 2021-12-09 10:24:37,069 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Failed to accept allocation proposal {quote} We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the following code in AbstrctQueue#canAssignToThisQueue still return false ```java Resources.greaterThanOrEqual(resourceCalculator, clusterResource, usedExceptKillable, currentLimitResource) ``` clusterResource = <memory:175117312, vCores:40222> usedExceptKillable = <memory:3381248, vCores:687> currentLimitResource = <memory:3420315, vCores:687> currentLimitResource: memory : 3381248/175117312 = 0.01930847362 vCores : 687/40222 = 0.01708020486 usedExceptKillable: memory : 3384320/175117312 = 0.01932601615 vCores : 688/40222 = 0.01710506687 DRF will think memory is dominated resource and return false in this scenario > Use node label reosurce as denominator to decide which resource is dominated > ----------------------------------------------------------------------------- > > Key: YARN-11082 > URL: https://issues.apache.org/jira/browse/YARN-11082 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Affects Versions: 3.1.1 > Reporter: Bo Li > Priority: Major > > We ued cluster resource as denominator to decide which resoure is dominated > in AbstrctQueue#canAssignToThisQueue. Howere nodes in our cluster are > configed differently. > {quote}2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application > attempt=appattempt_1637412555366_1588993_000001 container=null > queue=root.a.a1.a2 clusterResource=<memory:175117312, vCores:40222> > type=RACK_LOCAL requestedPartition=xx > 2021-12-09 10:24:37,069 DEBUG > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue: > Used resource=<memory:3381248, vCores:687> exceeded maxResourceLimit of the > queue =<memory:3420315, vCores:687> > 2021-12-09 10:24:37,069 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal > {quote} > We can find out that even thouth root.a.a1.a2 used 687/687 vcores, but the > following code in AbstrctQueue#canAssignToThisQueue still return false > {quote} > Resources.greaterThanOrEqual(resourceCalculator, clusterResource, > usedExceptKillable, currentLimitResource) > {quote} > clusterResource = <memory:175117312, vCores:40222> > usedExceptKillable = <memory:3381248, vCores:687> > currentLimitResource = <memory:3420315, vCores:687> > currentLimitResource: > memory : 3381248/175117312 = 0.01930847362 > vCores : 687/40222 = 0.01708020486 > usedExceptKillable: > memory : 3384320/175117312 = 0.01932601615 > vCores : 688/40222 = 0.01710506687 > DRF will think memory is dominated resource and return false in this scenario -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org