[ https://issues.apache.org/jira/browse/YARN-9598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859689#comment-16859689 ]
Juanjuan Tian edited comment on YARN-9598 at 6/10/19 3:47 AM: --------------------------------------------------------------- Hi Tao, # As discussed in YARN-9576, re-reservation proposal may be always generated on the same node and break the scheduling for this app and later apps. I think re-reservation is unnecessary and we can replace it with LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates for this app when multi-node enabled. for this, if re-reservation is disabled, the shouldAllocOrReserveNewContainer may return false in most cases, and thus even scheduler has a change to look up other candidates, it may not assign containers. 2. After this patch, since Assignment returned by FiCaSchedulerApp#assignContainers could never be null even if it's just skipped, thus, even only one of the candidates has been reserved for a contianer, the allocateFromReservedContainer will still never be null, it still breaks normal scheduler process. So I'm wondering why we just handle this case like sing-node, and change th logic in CapacityScheduler#allocateContainersOnMultiNodes like below !image-2019-06-10-11-37-44-975.png! /* * New behavior, allocate containers considering multiple nodes */ private CSAssignment allocateContainersOnMultiNodes( {color:#d04437}FiCaSchedulerNode schedulerNode{color}) { // Backward compatible way to make sure previous behavior which allocation // driven by node heartbeat works. if (getNode(schedulerNode.getNodeID()) != schedulerNode) { LOG.error("Trying to schedule on a removed node, please double check."); return null; } // Assign new containers... // 1. Check for reserved applications // 2. Schedule if there are no reservations RMContainer reservedRMContainer = schedulerNode.getReservedContainer(); {color:#d04437}if (reservedRMContainer != null) {{color} allocateFromReservedContainer(schedulerNode, false, reservedRMContainer); } // Do not schedule if there are any reservations to fulfill on the node if (schedulerNode.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + schedulerNode.getNodeID() + " is reserved by application " + schedulerNode.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {color:#d04437}PlacementSet<FiCaSchedulerNode> ps = getCandidateNodeSet(schedulerNode);{color} // When this time look at multiple nodes, try schedule if the // partition has any available resource or killable resource if (getRootQueue().getQueueCapacities().getUsedCapacity( ps.getPartition()) >= 1.0f && preemptionManager.getKillableResource( CapacitySchedulerConfiguration.ROOT, ps.getPartition()) == Resources .none()) { was (Author: jutia): Hi Tao, # As discussed in YARN-9576, re-reservation proposal may be always generated on the same node and break the scheduling for this app and later apps. I think re-reservation is unnecessary and we can replace it with LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates for this app when multi-node enabled. for this, if re-reservation is disabled, the shouldAllocOrReserveNewContainer may return false in most cases, and thus even scheduler has a change to look up other candidates, it may not assign containers. 2. After this patch, since Assignment returned by FiCaSchedulerApp#assignContainers could never be null even if it's just skipped, thus, even only one of the candidates has been reserved for a contianer, the allocateFromReservedContainer will still never be null, it still breaks normal scheduler process. So I'm wondering why we just handle this case like sing-node, and change th logic in CapacityScheduler#allocateContainersOnMultiNodes like below !image-2019-06-10-11-37-44-975.png! /* * New behavior, allocate containers considering multiple nodes */ private CSAssignment allocateContainersOnMultiNodes( {color:#d04437}FiCaSchedulerNode schedulerNode{color}) { // Backward compatible way to make sure previous behavior which allocation // driven by node heartbeat works. if (getNode(schedulerNode.getNodeID()) != schedulerNode) { LOG.error("Trying to schedule on a removed node, please double check."); return null; } // Assign new containers... // 1. Check for reserved applications // 2. Schedule if there are no reservations RMContainer reservedRMContainer = schedulerNode.getReservedContainer(); {color:#d04437}if (reservedRMContainer != null) {{color} allocateFromReservedContainer(schedulerNode, false, reservedRMContainer); } // Do not schedule if there are any reservations to fulfill on the node if (schedulerNode.getReservedContainer() != null) { if (LOG.isDebugEnabled()) { LOG.debug("Skipping scheduling since node " + schedulerNode.getNodeID() + " is reserved by application " + schedulerNode.getReservedContainer() .getContainerId().getApplicationAttemptId()); } return null; } {color:#d04437}PlacementSet<FiCaSchedulerNode> ps = getCandidateNodeSet(schedulerNode);{color} // When this time look at multiple nodes, try schedule if the // partition has any available resource or killable resource if (getRootQueue().getQueueCapacities().getUsedCapacity( ps.getPartition()) >= 1.0f && preemptionManager.getKillableResource( CapacitySchedulerConfiguration.ROOT, ps.getPartition()) == Resources .none()) { > Make reservation work well when multi-node enabled > -------------------------------------------------- > > Key: YARN-9598 > URL: https://issues.apache.org/jira/browse/YARN-9598 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Attachments: YARN-9598.001.patch, image-2019-06-10-11-37-43-283.png, > image-2019-06-10-11-37-44-975.png > > > This issue is to solve problems about reservation when multi-node enabled: > # As discussed in YARN-9576, re-reservation proposal may be always generated > on the same node and break the scheduling for this app and later apps. I > think re-reservation in unnecessary and we can replace it with > LOCALITY_SKIPPED to let scheduler have a chance to look up follow candidates > for this app when multi-node enabled. > # Scheduler iterates all nodes and try to allocate for reserved container in > LeafQueue#allocateFromReservedContainer. Here there are two problems: > ** The node of reserved container should be taken as candidates instead of > all nodes when calling FiCaSchedulerApp#assignContainers, otherwise later > scheduler may generate a reservation-fulfilled proposal on another node, > which will always be rejected in FiCaScheduler#commonCheckContainerAllocation. > ** Assignment returned by FiCaSchedulerApp#assignContainers could never be > null even if it's just skipped, it will break the normal scheduling process > for this leaf queue because of the if clause in LeafQueue#assignContainers: > "if (null != assignment) \{ return assignment;}" > # Nodes which have been reserved should be skipped when iterating candidates > in RegularContainerAllocator#allocate, otherwise scheduler may generate > allocation or reservation proposal on these node which will always be > rejected in FiCaScheduler#commonCheckContainerAllocation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org