[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Ziqian updated YARN-9163:
----------------------------
    Description: 
We have a cluster with 4000+ node and 10w+ app per-day in our production 
environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
process is stuck and ha doesn't happen, which means all the cluster stops 
service and we can only fix it by reboot active rm. We can reproduce on our 
production cluster every time but can't reproduce in our test environment which 
only has 100+ nodes and few apps. Both of our production and test environment 
use CapacityScheduler which open asyncSchedule function and preemption

Analyzing the jstack of active rm, we found a dead lock in it:

thread one( refreshqueue thread):
 * take write lock of capacity scheduler
 * take write lock of preemptionManager 
 * wait read lock of root queue

thread two (asyncScheduleThread)  
 * take read lock of root queue
 * wait write lock of PreemptionManager

thread three (ipc handler on 8030 which deal the allocate )
 * wait write lock of root queue

These three thread work with a dead lock.

 

The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
blocks future readLock despite policy 
unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
this problem, we change the logic of  refreshqueue thread, get a queue info 
copy first and avoid the thread to take write lock of preemptionManager  and 
read lock of root queue at the same time.

 

We test our new code in our production environment and the refresh queue 
command works well.

 

  was:
We have a cluster with 4000+ node and 10w+ app per-day in our production 
environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
process is stuck and ha doesn't happen, which means all the cluster stops 
service and we can only fix it by reboot active rm. We can reproduce on our 
production cluster every time but can't reproduce in our test environment which 
only has 100+ nodes and few apps.

One additional infomation, both of our production and test environment use 
CapacityScheduler which open asyncSchedule function and preemption

Analyzing the jstack of active rm, we found a dead lock in it:

thread one( refreshqueue thread):
 * take write lock of capacity scheduler
 * take write lock of preemptionManager 
 * wait read lock of root queue

thread two (asyncScheduleThread)  
 * take read lock of root queue
 * wait write lock of PreemptionManager

thread three (ipc handler on 8030 which deal the allocate )
 * wait write lock of root queue

These three thread work with a dead lock.

 

The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
blocks future readLock despite policy 
unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
this problem, we change the logic of  refreshqueue thread, get a queue info 
copy first and avoid the thread to take write lock of preemptionManager  and 
read lock of root queue at the same time.

 

We test our new code in our production environment and the refresh queue 
command works well.

 


> Deadlock when use yarn rmadmin -refreshQueues
> ---------------------------------------------
>
>                 Key: YARN-9163
>                 URL: https://issues.apache.org/jira/browse/YARN-9163
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Major
>         Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to