[ https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784388#comment-15784388 ]
Tao Yang commented on YARN-6029: -------------------------------- Thanks [~wangda]. Updated priority to Critical and Attached new patch for review. This patch needs add indentation in synchronized block. Diff code without changing space like this: {code} @Override - public synchronized CSAssignment assignContainers(Resource clusterResource, + public CSAssignment assignContainers(Resource clusterResource, FiCaSchedulerNode node, ResourceLimits currentResourceLimits, SchedulingMode schedulingMode) { + synchronized (this) { updateCurrentResourceLimits(currentResourceLimits, clusterResource); if (LOG.isDebugEnabled()) { @@ -906,6 +907,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, } setPreemptionAllowed(currentResourceLimits, node.getPartition()); + } // Check for reserved resources RMContainer reservedContainer = node.getReservedContainer(); @@ -923,6 +925,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, } } + synchronized (this) { // if our queue cannot access this node, just return if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY && !accessibleToPartition(node.getPartition())) { @@ -1019,6 +1022,7 @@ public synchronized CSAssignment assignContainers(Resource clusterResource, return CSAssignment.NULL_ASSIGNMENT; } + } {code} > CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by > Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to > release a reserved container > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-6029 > URL: https://issues.apache.org/jira/browse/YARN-6029 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.8.0 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Critical > Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack > > > When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls > YarnClient#getQueueAclsInfo) just at the moment that > LeafQueue#assignContainers is called and before notifying parent queue to > release resource (should release a reserved container), then ResourceManager > can deadlock. I found this problem on our testing environment for hadoop2.8. > Reproduce the deadlock in chronological order > * 1. Thread A (ResourceManager Event Processor) calls synchronized > LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a) > * 2. Thread B (IPC Server handler) calls synchronized > ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue > root), iterates over children queue acls and is blocked when calling > synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of > queue root.a is hold by Thread A) > * 3. Thread A wants to inform the parent queue that a container is being > completed and is blocked when invoking synchronized > ParentQueue#internalReleaseResource method (the ParentQueue instance lock of > queue root is hold by Thread B) > I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be > removed to solve this problem, since this method appears to not affect fields > of LeafQueue instance. > Attach patch with UT for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org