[ https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735423#comment-16735423 ]
Hu Ziqian commented on YARN-9163: --------------------------------- [~leftnoteasy],here‘s the jstack. [^rm.jstack.ziqian.log] > Deadlock when use yarn rmadmin -refreshQueues > --------------------------------------------- > > Key: YARN-9163 > URL: https://issues.apache.org/jira/browse/YARN-9163 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.1.1 > Reporter: Hu Ziqian > Assignee: Hu Ziqian > Priority: Blocker > Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log > > > We have a cluster with 4000+ node and 10w+ app per-day in our production > environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's > process is stuck and ha doesn't happen, which means all the cluster stops > service and we can only fix it by reboot active rm. We can reproduce on our > production cluster every time but can't reproduce in our test environment > which only has 100+ nodes and few apps. Both of our production and test > environment use CapacityScheduler which open asyncSchedule function and > preemption > Analyzing the jstack of active rm, we found a dead lock in it: > thread one( refreshqueue thread): > * take write lock of capacity scheduler > * take write lock of preemptionManager > * wait read lock of root queue > thread two (asyncScheduleThread) > * take read lock of root queue > * wait write lock of PreemptionManager > thread three (ipc handler on 8030 which deal the allocate ) > * wait write lock of root queue > These three thread work with a dead lock. > > The deadlock happens because of a "bug" of ReadWriteLock: writeLock request > blocks future readLock despite policy > unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve > this problem, we change the logic of refreshqueue thread, get a queue info > copy first and avoid the thread to take write lock of preemptionManager and > read lock of root queue at the same time. > > We test our new code in our production environment and the refresh queue > command works well. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org