[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Ziqian updated YARN-9163:
----------------------------
    Attachment: rm.jstack.ziqian.log

> Deadlock when use yarn rmadmin -refreshQueues
> ---------------------------------------------
>
>                 Key: YARN-9163
>                 URL: https://issues.apache.org/jira/browse/YARN-9163
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Blocker
>         Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to