[ 
https://issues.apache.org/jira/browse/YARN-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248087#comment-17248087
 ] 

Wangda Tan commented on YARN-10530:
-----------------------------------

I haven't written any UT yet, but I just want to file the ticket to make sure 
we take a closer look because the logic looks confusing. I will be delighted if 
this is a false alarm :) 

> CapacityScheduler ResourceLimits doesn't handle node partition well
> -------------------------------------------------------------------
>
>                 Key: YARN-10530
>                 URL: https://issues.apache.org/jira/browse/YARN-10530
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>            Reporter: Wangda Tan
>            Priority: Blocker
>
> This is a serious bug may impact all releases, I need to do further check but 
> I want to log the JIRA so we will not forget:  
> ResourceLimits objects are used to handle two purposes: 
> 1) When there's cluster resource change, for example adding new node, or 
> scheduler config reinitialize. We will pass ResourceLimits to 
> updateClusterResource to queues. 
> 2) When allocate container, we try to pass parent's available resource to 
> child to make sure child's resource allocation won't violate parent's max 
> resource. For example below: 
> {code}
> queue         used  max
> --------------------------------------
> root          10    20
> root.a        8     10
> root.a.a1     2     10
> root.a.a2     6     10
> {code}
> Even though a.a1 has 8 resources headroom (a1.max - a1.used). But we can at 
> most allocate 2 resources to a1 because root.a's limit will hit first. This 
> information will be passed down from parent queue to child queue during 
> assignContainers call via ResourceLimits. 
> However, we only pass 1 ResourceLimits from top, for queue initialize, we 
> passed in: 
> {code}
>     root.updateClusterResource(clusterResource, new ResourceLimits(
>         clusterResource));
> {code}
> And when we update cluster resource, we only considered default partition
> {code}
>       // Update all children
>       for (CSQueue childQueue : childQueues) {
>         // Get ResourceLimits of child queue before assign containers
>         ResourceLimits childLimits = getResourceLimitsOfChild(childQueue,
>             clusterResource, resourceLimits,
>             RMNodeLabelsManager.NO_LABEL, false);
>         childQueue.updateClusterResource(clusterResource, childLimits);
>       }
> {code}
> Same for allocation logic, we passed in: (Actually I found I added a TODO 
> item 5 years ago).
> {code}
>     // Try to use NON_EXCLUSIVE
>     assignment = getRootQueue().assignContainers(getClusterResource(),
>         candidates,
>         // TODO, now we only consider limits for parent for non-labeled
>         // resources, should consider labeled resources as well.
>         new ResourceLimits(labelManager
>             .getResourceByLabel(RMNodeLabelsManager.NO_LABEL,
>                 getClusterResource())),
>         SchedulingMode.IGNORE_PARTITION_EXCLUSIVITY);
> {code} 
> The good thing is, in the assignContainers call, we calculated child limit 
> based on partition
> {code} 
> ResourceLimits childLimits =
>           getResourceLimitsOfChild(childQueue, cluster, limits,
>               candidates.getPartition(), true);
> {code} 
> So I think now the problem is, when a named partition has more resource than 
> default partition, effective min/max resource of each queue could be wrong.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to