Wangda Tan created YARN-10530: --------------------------------- Summary: CapacityScheduler ResourceLimits doesn't handle node partition well Key: YARN-10530 URL: https://issues.apache.org/jira/browse/YARN-10530 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, capacityscheduler Reporter: Wangda Tan
This is a serious bug may impact all releases, I need to do further check but I want to log the JIRA so we will not forget: ResourceLimits objects are used to handle two purposes: 1) When there's cluster resource change, for example adding new node, or scheduler config reinitialize. We will pass ResourceLimits to updateClusterResource to queues. 2) When allocate container, we try to pass parent's available resource to child to make sure child's resource allocation won't violate parent's max resource. For example below: {code} queue used max -------------------------------------- root 10 20 root.a 8 10 root.a.a1 2 10 root.a.a2 6 10 {code} Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at most allocate 2 resources to a1 because root.a's limit will hit first. This information will be passed down from parent queue to child queue during assignContainers call via ResourceLimits. However, we only pass 1 ResourceLimits from top, for queue initialize, we passed in: {code} root.updateClusterResource(clusterResource, new ResourceLimits( clusterResource)); {code} And when we update cluster resource, we only considered default partition {code} // Update all children for (CSQueue childQueue : childQueues) { // Get ResourceLimits of child queue before assign containers ResourceLimits childLimits = getResourceLimitsOfChild(childQueue, clusterResource, resourceLimits, RMNodeLabelsManager.NO_LABEL, false); childQueue.updateClusterResource(clusterResource, childLimits); } {code} Same for allocation logic, we passed in: (Actually I found I added a TODO item 5 years ago). {code} // Try to use NON_EXCLUSIVE assignment = getRootQueue().assignContainers(getClusterResource(), candidates, // TODO, now we only consider limits for parent for non-labeled // resources, should consider labeled resources as well. new ResourceLimits(labelManager .getResourceByLabel(RMNodeLabelsManager.NO_LABEL, getClusterResource())), SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY); {code} The good thing is, in the assignContainers call, we calculated child limit based on partition {code} ResourceLimits childLimits = getResourceLimitsOfChild(childQueue, cluster, limits, candidates.getPartition(), true); {code} So I think now the problem is, when a named partition has more resource than default partition, effective min/max resource of each queue could be wrong. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org