[
https://issues.apache.org/jira/browse/YARN-11573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895200#comment-17895200
]
Junfan Zhang edited comment on YARN-11573 at 11/4/24 6:45 AM:
--------------------------------------------------------------
This patch is important for the multi nodes placement, I found this bug in
https://issues.apache.org/jira/browse/YARN-11728 too.
And I think this patch should be enabled by default, otherwise this bug will
make the scheduler hang for some applications.
[~snemeth]
was (Author: zuston):
This patch is important for the multi nodes placement, I found this bug in
https://issues.apache.org/jira/browse/YARN-11728 too.
And I think this patch should be enabled by default, otherwise this bug will
make the scheduler hang for some applications.
> Add config option to make container allocation prefer nodes without reserved
> containers
> ---------------------------------------------------------------------------------------
>
> Key: YARN-11573
> URL: https://issues.apache.org/jira/browse/YARN-11573
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Affects Versions: 3.4.0
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Minor
> Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Applications could be stuck when the container allocation logic does not
> consider more nodes, but only nodes that are having reserved containers.
> This behavior can even block new AMs to be allocated on nodes so they don't
> reach the running status.
> A jira that mentions the same thing is YARN-9598:
> {quote}Nodes which have been reserved should be skipped when iterating
> candidates in RegularContainerAllocator#allocate, otherwise scheduler may
> generate allocation or reservation proposal on these node which will always
> be rejected in FiCaScheduler#commonCheckContainerAllocation.
> {quote}
> Since this jira implements 2 other points, I decided to create this one and
> implement the 3rd point separately.
> h2. Notes:
> 1. FiCaSchedulerApp#commonCheckContainerAllocation will log this:
> {code:java}
> Trying to allocate from reserved container in async scheduling mode
> {code}
> in case RegularContainerAllocator creates a reservation proposal for nodes
> having reserved container.
> 2. A better way is to prevent generating an AM container (or even normal
> container) allocation proposal on a node if it already has a reservation on
> it and we still have more nodes to check in the preferred node set.
> Completely disabling task containers from being allocated to worker nodes
> could limit the downscaling ability that we have currently.
> h2. 3. CALL HIERARCHY
> 1.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#nodeUpdate
> 2.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.api.records.NodeId,
> boolean)
> 3.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersToNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.CandidateNodeSet<org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode>,
> boolean)
> 3.1. This is the place where it is decided whether to call
> allocateContainerOnSingleNode or allocateContainersOnMultiNodes
> 4.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateContainersOnMultiNodes
> 5.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#allocateOrReserveNewContainers
> 6.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue#assignContainers
> 7.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractParentQueue#assignContainersToChildQueues
> 8.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractLeafQueue#assignContainers
> 9.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#assignContainers
> 10.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainers
> 11.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#allocate
> 12.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#tryAllocateOnNode
> 13.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainersOnNode
> 14.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignNodeLocalContainers
> 15.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator#assignContainer
> Logs these lines as an example:
> {code:java}
> 2023-08-23 17:44:08,129 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator:
> assignContainers: node=<host> application=application_1692304118418_3151
> priority=0 pendingAsk=<per-allocation-resource=<memory:5632,
> vCores:1>,repeat=1> type=OFF_SWITCH
> {code}
> h2. 4. DETAILS OF RegularContainerAllocator#allocate
> [Method
> definition|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L826-L896]
> 4.1. Defining ordered list of nodes to allocate containers on:
> [LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L851-L852]
> {code:java}
> Iterator<FiCaSchedulerNode> iter = schedulingPS.getPreferredNodeIterator(
> candidates);
> {code}
> 4.2.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.AppPlacementAllocator#getPreferredNodeIterator
> 4.3.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSortingManager#getMultiNodeSortIterator
>
> ([LINK|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L114-L180])
> In this method, the MultiNodeLookupPolicy is resolved
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/placement/MultiNodeSortingManager.java#L142-L143]
> 4.4.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.MultiNodeSorter#getMultiNodeLookupPolicy
> 4.5. This is where the MultiNodeLookupPolicy implementation of
> getPreferredNodeIterator is invoked
> h2. 5. GOING UP THE CALL HIERARCHY UNTIL
> CapacityScheduler#allocateOrReserveNewContainers
> 1. CSAssigment is created
> [here|https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java#L1797-L1801]
> in method: CapacityScheduler#allocateOrReserveNewContainers
> 2.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#submitResourceCommitRequest
> 3.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#tryCommit
> 4.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#accept
> 5.
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp#commonCheckContainerAllocation
> --> This returns false and logs this line:
> {code:java}
> 2023-08-23 17:44:08,130 DEBUG
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
> Trying to allocate from reserved container in async scheduling mode
> {code}
> h2. PROPOSED FIX
> In method: RegularContainerAllocator#allocate
> There's a loop that iterates over candidate nodes:
> [https://github.com/apache/hadoop/blob/9342ecf6ccd5c7ef443a0eb722852d2addc1d5db/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/allocator/RegularContainerAllocator.java#L853-L895]
> We need to skip the nodes that are having a reservation, example code:
> {code:java}
> if (reservedContainer == null) {
> 840 // Do not schedule if there are any reservations to fulfill on
> the node
> 841 if (node.getReservedContainer() != null) {
> 842 LOG.debug("Skipping scheduling on node {} since it has
> already been"
> 843 + " reserved by {}", node.getNodeID(),
> 844 node.getReservedContainer().getContainerId());
> 845
> ActivitiesLogger.APP.recordSkippedAppActivityWithoutAllocation(
> 846 activitiesManager, node, application, schedulerKey,
> 847 ActivityDiagnosticConstant.NODE_HAS_BEEN_RESERVED);
> 848 continue;
> 849 }
> {code}
> NOTE: This code block is copied from [^YARN-9598.001.patch#file-5]
> h2. More notes for the implementation
> 1. This new behavior need to be hidden behind a feature flag (CS config).
> In my understanding, the [^YARN-9598.001.patch#file-5] skips all the nodes
> with reservations, regardless of the container's type whether it's an AM
> container or a task container.
> 2. Only skip the actual node with existing reservation if there are more
> nodes to process with the iterator.
> 3. Add testcase to cover this scenario
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]