[ 
https://issues.apache.org/jira/browse/YARN-6029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15784388#comment-15784388
 ] 

Tao Yang commented on YARN-6029:
--------------------------------

Thanks [~wangda].
Updated priority to Critical and Attached new patch for review.
This patch needs add indentation in synchronized block. Diff code without 
changing space like this:
{code}
   @Override
-  public synchronized CSAssignment assignContainers(Resource clusterResource,
+  public CSAssignment assignContainers(Resource clusterResource,
       FiCaSchedulerNode node, ResourceLimits currentResourceLimits,
       SchedulingMode schedulingMode) {
+    synchronized (this) {
       updateCurrentResourceLimits(currentResourceLimits, clusterResource);

       if (LOG.isDebugEnabled()) {
@@ -906,6 +907,7 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
       }

       setPreemptionAllowed(currentResourceLimits, node.getPartition());
+    }

     // Check for reserved resources
     RMContainer reservedContainer = node.getReservedContainer();
@@ -923,6 +925,7 @@ public synchronized CSAssignment assignContainers(Resource 
clusterResource,
       }
     }

+    synchronized (this) {
       // if our queue cannot access this node, just return
       if (schedulingMode == SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY
           && !accessibleToPartition(node.getPartition())) {
@@ -1019,6 +1022,7 @@ public synchronized CSAssignment 
assignContainers(Resource clusterResource,

       return CSAssignment.NULL_ASSIGNMENT;
     }
+  }
{code}

> CapacityScheduler deadlock when ParentQueue#getQueueUserAclInfo is called by 
> Thread_A at the moment that Thread_B calls LeafQueue#assignContainers to 
> release a reserved container
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6029
>                 URL: https://issues.apache.org/jira/browse/YARN-6029
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Critical
>         Attachments: YARN-6029.001.patch, YARN-6029.002.patch, deadlock.jstack
>
>
> When ParentQueue#getQueueUserAclInfo is called (e.g. a client calls 
> YarnClient#getQueueAclsInfo) just at the moment that 
> LeafQueue#assignContainers is called and before notifying parent queue to 
> release resource (should release a reserved container), then ResourceManager 
> can deadlock. I found this problem on our testing environment for hadoop2.8.
> Reproduce the deadlock in chronological order
> * 1. Thread A (ResourceManager Event Processor) calls synchronized 
> LeafQueue#assignContainers (got LeafQueue instance lock of queue root.a)
> * 2. Thread B (IPC Server handler) calls synchronized 
> ParentQueue#getQueueUserAclInfo (got ParentQueue instance lock of queue 
> root), iterates over children queue acls and is blocked when calling 
> synchronized LeafQueue#getQueueUserAclInfo (the LeafQueue instance lock of 
> queue root.a is hold by Thread A)
> * 3. Thread A wants to inform the parent queue that a container is being 
> completed and is blocked when invoking synchronized 
> ParentQueue#internalReleaseResource method (the ParentQueue instance lock of 
> queue root is hold by Thread B)
> I think the synchronized modifier of LeafQueue#getQueueUserAclInfo can be 
> removed to solve this problem, since this method appears to not affect fields 
> of LeafQueue instance.
> Attach patch with UT for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to