[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112051#comment-17112051 ]
Peter Bacsko edited comment on YARN-10283 at 5/20/20, 10:59 AM: ---------------------------------------------------------------- Quick workaround: {noformat} if (null == unreservedContainer) { // Skip the locality request ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( activitiesManager, node, application, schedulerKey, ActivityDiagnosticConstant. NODE_CAN_NOT_FIND_CONTAINER_TO_BE_UNRESERVED_WHEN_NEEDED, ActivityLevel.NODE); return ContainerAllocation.LOCALITY_SKIPPED; } } } // ************************************ // Defends against container allocation // ************************************ if (!node.getLabels().isEmpty() && needToUnreserve) { LOG.debug("Using label: {} - needed to unreserve container", node.getPartition()); return ContainerAllocation.LOCALITY_SKIPPED; } ContainerAllocation result = new ContainerAllocation(unreservedContainer, pendingAsk.getPerAllocationResource(), AllocationState.ALLOCATED); result.containerNodeType = type; result.setToKillContainers(toKillContainers); return result; {noformat} A better solution is probably to extend {{FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerNode, SchedulerRequestKey, Resource)}} with the partition or create an entirely new method. was (Author: pbacsko): Quick workaround: {noformat} if (null == unreservedContainer) { // Skip the locality request ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation( activitiesManager, node, application, schedulerKey, ActivityDiagnosticConstant. NODE_CAN_NOT_FIND_CONTAINER_TO_BE_UNRESERVED_WHEN_NEEDED, ActivityLevel.NODE); return ContainerAllocation.LOCALITY_SKIPPED; } } } // defends against container allocation if (!node.getLabels().isEmpty() && needToUnreserve) { LOG.debug("Using label: {} - needed to unreserve container", node.getPartition()); return ContainerAllocation.LOCALITY_SKIPPED; } ContainerAllocation result = new ContainerAllocation(unreservedContainer, pendingAsk.getPerAllocationResource(), AllocationState.ALLOCATED); result.containerNodeType = type; result.setToKillContainers(toKillContainers); return result; {noformat} A better solution is probably to extend {{FiCaSchedulerApp.findNodeToUnreserve(FiCaSchedulerNode, SchedulerRequestKey, Resource)}} with the partition or create an entirely new method. > Capacity Scheduler: starvation occurs if a higher priority queue is full a > and node labels are used > --------------------------------------------------------------------------------------------------- > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of <memory:5120, vCores:8> > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage <memory:2560, > vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource <memory:5120, vCores:8> > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for <memory:512, vcores:4> > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we succeed with the allocation if there's room > for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org