[jira] [Created] (YARN-11843) Fix potential deadlock when auto-correction of container allocation is enabled

Tao Yang (Jira) Mon, 04 Aug 2025 19:23:07 -0700

Tao Yang created YARN-11843:
-------------------------------

             Summary: Fix potential deadlock when auto-correction of container 
allocation is enabled
                 Key: YARN-11843
                 URL: https://issues.apache.org/jira/browse/YARN-11843
             Project: Hadoop YARN
          Issue Type: Bug
          Components: scheduler
    Affects Versions: 3.5.0
            Reporter: Tao Yang
            Assignee: Tao Yang



The feature introduced in YARN-11702 has a potential deadlock issue. When 
enabled, it can cause deadlock when holding application-level write locks while 
trying to acquire queue-level write locks.
 
Root Cause:
 - autoCorrectContainerAllocation is called while holding application-level 
write locks
 - It directly calls completedContainer() which requires queue-level write locks
{code:java}
CapacityScheduler#allocate
   --> ...
       application.getWriteLock().lock();   //1. requires app writeLock!!!
       try{
           ...
           AbstractYarnScheduler#autoCorrectContainerAllocation
              --> AbstractYarnScheduler#completedContainer 
                  --> AbstractYarnScheduler#completedContainerInternal
                      --> AbstractLeafQueue#completedContainer
                           writeLock.lock()  //2. requires queue writeLock!!!
                           try{
                               ...
                               FiCaSchedulerApp#containerCompleted
                                   //3. requires app writeLock!!!
                           }finally{
                               writeLock.unlock();
                           }
       }finally{
           application.getWriteLock().unlock();
       }{code}

 - This violates lock hierarchy and creates deadlock scenarios, since 
AbstractYarnScheduler#completedContainer could be called from another thread 
during normal container completion operations.
 

Solution:
Replace direct completedContainer() calls with asyncContainerRelease() in 
autoCorrectContainerAllocation method.
Before:
{code:java}
completedContainer(rmContainer, ...); // Direct call causes deadlock {code}
After:
{code:java}
asyncContainerRelease(rmContainer); // Async call avoids deadlock {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YARN-11843) Fix potential deadlock when auto-correction of container allocation is enabled

Reply via email to