Tao Yang created YARN-11843:
-------------------------------
Summary: Fix potential deadlock when auto-correction of container
allocation is enabled
Key: YARN-11843
URL: https://issues.apache.org/jira/browse/YARN-11843
Project: Hadoop YARN
Issue Type: Bug
Components: scheduler
Affects Versions: 3.5.0
Reporter: Tao Yang
Assignee: Tao Yang
The feature introduced in YARN-11702 has a potential deadlock issue. When
enabled, it can cause deadlock when holding application-level write locks while
trying to acquire queue-level write locks.
Root Cause:
- autoCorrectContainerAllocation is called while holding application-level
write locks
- It directly calls completedContainer() which requires queue-level write locks
{code:java}
CapacityScheduler#allocate
--> ...
application.getWriteLock().lock(); //1. requires app writeLock!!!
try{
...
AbstractYarnScheduler#autoCorrectContainerAllocation
--> AbstractYarnScheduler#completedContainer
--> AbstractYarnScheduler#completedContainerInternal
--> AbstractLeafQueue#completedContainer
writeLock.lock() //2. requires queue writeLock!!!
try{
...
FiCaSchedulerApp#containerCompleted
//3. requires app writeLock!!!
}finally{
writeLock.unlock();
}
}finally{
application.getWriteLock().unlock();
}{code}
- This violates lock hierarchy and creates deadlock scenarios, since
AbstractYarnScheduler#completedContainer could be called from another thread
during normal container completion operations.
Solution:
Replace direct completedContainer() calls with asyncContainerRelease() in
autoCorrectContainerAllocation method.
Before:
{code:java}
completedContainer(rmContainer, ...); // Direct call causes deadlock {code}
After:
{code:java}
asyncContainerRelease(rmContainer); // Async call avoids deadlock {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]