[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full a and node labels are used

Peter Bacsko (Jira) Wed, 20 May 2020 04:00:20 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112051#comment-17112051
 ]


Peter Bacsko edited comment on YARN-10283 at 5/20/20, 10:59 AM:
----------------------------------------------------------------

Quick workaround:
{noformat}
          if (null == unreservedContainer) {
            // Skip the locality request
            ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
                activitiesManager, node, application, schedulerKey,
                ActivityDiagnosticConstant.
                    NODE_CAN_NOT_FIND_CONTAINER_TO_BE_UNRESERVED_WHEN_NEEDED,
                ActivityLevel.NODE);
            return ContainerAllocation.LOCALITY_SKIPPED;
          }
        }
      }

      // ************************************
      // Defends against container allocation
      // ************************************
      if (!node.getLabels().isEmpty() && needToUnreserve) {
        LOG.debug("Using label: {} - needed to unreserve container", 
node.getPartition());
        return ContainerAllocation.LOCALITY_SKIPPED;
      }

      ContainerAllocation result = new ContainerAllocation(unreservedContainer,
          pendingAsk.getPerAllocationResource(), AllocationState.ALLOCATED);
      result.containerNodeType = type;
      result.setToKillContainers(toKillContainers);
      return result;
{noformat}
A better solution is probably to extend 
{{FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerNode, SchedulerRequestKey, 
Resource)}} with the partition or create an entirely new method.


was (Author: pbacsko):
Quick workaround:

{noformat}
          if (null == unreservedContainer) {
            // Skip the locality request
            ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
                activitiesManager, node, application, schedulerKey,
                ActivityDiagnosticConstant.
                    NODE_CAN_NOT_FIND_CONTAINER_TO_BE_UNRESERVED_WHEN_NEEDED,
                ActivityLevel.NODE);
            return ContainerAllocation.LOCALITY_SKIPPED;
          }
        }
      }

      // defends against container allocation
      if (!node.getLabels().isEmpty() && needToUnreserve) {
        LOG.debug("Using label: {} - needed to unreserve container", 
node.getPartition());
        return ContainerAllocation.LOCALITY_SKIPPED;
      }

      ContainerAllocation result = new ContainerAllocation(unreservedContainer,
          pendingAsk.getPerAllocationResource(), AllocationState.ALLOCATED);
      result.containerNodeType = type;
      result.setToKillContainers(toKillContainers);
      return result;
{noformat}

A better solution is probably to extend 
{{FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerNode, SchedulerRequestKey, 
Resource)}} with the partition or create an entirely new method.

> Capacity Scheduler: starvation occurs if a higher priority queue is full a 
> and node labels are used
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10283
>                 URL: https://issues.apache.org/jira/browse/YARN-10283
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>
> Recently we've been investigating a scenario where applications submitted to 
> a lower priority queue could not get scheduled because a higher priority 
> queue in the same hierarchy could now satisfy the allocation request. Both 
> queue belonged to the same partition.
> If we disabled node labels, the problem disappeared.
> The problem is that {{RegularContainerAllocator}} always allocated a 
> container for the request, even if it should not have.
> *Example:*
> * Cluster total resources: 3 nodes, 15GB, 24 vcores
> * Partition "shared" was created with 2 nodes
> * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were 
> added to the partition
> * Both queues have a limit of <memory:5120, vCores:8>
> * Using DominantResourceCalculator
> Setup:
> Submit distributed shell application to highprio with switches 
> "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per 
> container.
> Chain of events:
> 1. Queue is filled with contaners until it reaches usage <memory:2560, 
> vCores:5>
> 2. A node update event is pushed to CS from a node which is part of the 
> partition
> 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
> than the current limit resource <memory:5120, vCores:8>
> 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
> allocated container for <memory:512, vcores:4>
> 4. But we can't commit the resource request because we would have 9 vcores in 
> total, violating the limit.
> The problem is that we always try to assign container for the same 
> application in each heartbeat from "highprio". Applications in "lowprio" 
> cannot make progress.
> *Problem:*
> {{RegularContainerAllocator.assignContainer()}} does not handle this case 
> well. We only reject allocation if this condition is satisfied:
> {noformat}
>  if (rmContainer == null && reservationsContinueLooking
>           && node.getLabels().isEmpty()) {
> {noformat}
> But if we have node labels, we succeed with the allocation if there's room 
> for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full a and node labels are used

Reply via email to