[ 
https://issues.apache.org/jira/browse/YARN-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16258977#comment-16258977
 ] 

Wilfred Spiegelenburg commented on YARN-7534:
---------------------------------------------

I would like to work on this one if you don't mind

I think two things are getting mixed up: the queue used resources are not 
linked to the node. It is the sum of all the resources of containers from 
applications that run in a queue. A node heartbeat with a changed usage does 
not mean that the usage changed because an application in the queue has changed 
it. It could have changed due to a different queue/application adding a 
container.

We're also not allocating anything just yet and have thus not gone over. When 
the application is updated, at a later point in time, that is when we do that 
check. We just have a preliminary check here to see if we can offer this node 
to the queue. Another point to take into account: we are not checking what the 
application asked for here. That is the next step that follows just below when 
we run over all the applications that have a demand:

{code}
    for (FSAppAttempt sched : fetchAppsWithDemand(true)) {
      if (SchedulerAppUtils.isPlaceBlacklisted(sched, node, LOG)) {
        continue;
      }
      assigned = sched.assignContainer(node);
{code}

This is the earliest we can find what the ask is. If there are more 
applications with a demand for the queue we walk over the list. We call 
[assignContainer 
|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L830]
and that is where the checks happen.
One of the checks we perform is in hasContainerForNode for the FSAppAttempt:
{code}
    } else if (!getQueue().fitsInMaxShare(resource)) {
      // The requested container must fit in queue maximum share
      updateAMDiagnosticMsg(resource,
          " exceeds current queue or its parents maximum resource allowed).");

      ret = false;
{code}

Which makes the allocation fail and thus we drop out and check the next request 
for the application and if that all fails we check the next application in the 
list from apps with demand.

Do you have any logs that show that this is not working as it should?


> Fair scheduler assign resources may exceed maxResources
> -------------------------------------------------------
>
>                 Key: YARN-7534
>                 URL: https://issues.apache.org/jira/browse/YARN-7534
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>            Reporter: YunFan Zhou
>
> The logic we're scheduling now is to check whether the resources used by the 
> queue has exceeded *maxResources* before assigning the container. This will 
> leads to the fact that after assigning this container the queue uses more 
> resources than *maxResources*.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to