[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735651#comment-16735651
 ] 

Weiwei Yang commented on YARN-9163:
---

Hi [~ziqian hu]

Thanks for the updates. I am afraid unless this can be reproducible on a 
community version, we are unable to track down. As this is listed as a release 
blocker, I am going to close this Jira for now. If you find a way to reproduce, 
feel free to reopen. Lets unblock 3.1.2 release first. Thanks.

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-06 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735431#comment-16735431
 ] 

Hadoop QA commented on YARN-9163:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  6s{color} 
| {color:red} YARN-9163 does not apply to trunk. Rebase required? Wrong Branch? 
See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | YARN-9163 |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/23002/console |
| Powered by | Apache Yetus 0.8.0   http://yetus.apache.org |


This message was automatically generated.



> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-06 Thread Hu Ziqian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735423#comment-16735423
 ] 

Hu Ziqian commented on YARN-9163:
-

[~leftnoteasy],here‘s the jstack. [^rm.jstack.ziqian.log]

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-06 Thread Hu Ziqian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735424#comment-16735424
 ] 

Hu Ziqian commented on YARN-9163:
-

[~cheersyang], actually we used a internal version of  hadoop based on 2.8 
while backport global scheduler in it. I'm not sure which community's version 
matches it.

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-06 Thread Hu Ziqian (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16735426#comment-16735426
 ] 

Hu Ziqian commented on YARN-9163:
-

[~Tao Yang],

Sorry, not found any new evidence yet.

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch, rm.jstack.ziqian.log
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-02 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732634#comment-16732634
 ] 

Tao Yang commented on YARN-9163:


Hi [~ziqian hu]
We have communicated about this problem offline before, it seems not to be a 
normal deadlock in the jstack (these three stacks should not block since 
readlock can be hold by multiple threads at the same time), and not sure 
whether or not thread one is waiting for the read lock. We realized that the 
JDK bug is a suspect instead of the explicit cause at last communication and I 
have suggested you to do further investigation. Do you have a new evidence to 
prove that this problem is indeed caused by the JDK bug?

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-02 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732255#comment-16732255
 ] 

Wangda Tan commented on YARN-9163:
--

[~ziqian hu], could u upload jstack or at least 3 full stacktrace of the 
threads you mentioned? I couldn't locate the issue you mentioned.

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-02 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732150#comment-16732150
 ] 

Weiwei Yang commented on YARN-9163:
---

+[~Tao Yang] in the loop

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-02 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732149#comment-16732149
 ] 

Weiwei Yang commented on YARN-9163:
---

Hi [~ziqian hu]

This seems to be a blocker, or at least critical. Increased the severity. If 
possible, could you pls attach the jstack file as well?

And, what's the YARN version on this cluster? Looks like this only happens when 
async scheduling is enabled. I've seen some large deployment clusters with 
async scheduling enabled on 3.1 base, but not seen this issue before.

Thanks 

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2019-01-02 Thread Sunil Govindan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732143#comment-16732143
 ] 

Sunil Govindan commented on YARN-9163:
--

cc [~cheersyang] [~leftnoteasy]

I can see that the bug is made duplicated. Given it is fixed latest java 
versions, is this change necessary?

> Deadlock when use yarn rmadmin -refreshQueues
> -
>
> Key: YARN-9163
> URL: https://issues.apache.org/jira/browse/YARN-9163
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Blocker
> Attachments: YARN-9163.001.patch
>
>
> We have a cluster with 4000+ node and 10w+ app per-day in our production 
> environment. When we use CLI: yarn rmadmin -refreshQueues, the active rm's 
> process is stuck and ha doesn't happen, which means all the cluster stops 
> service and we can only fix it by reboot active rm. We can reproduce on our 
> production cluster every time but can't reproduce in our test environment 
> which only has 100+ nodes and few apps. Both of our production and test 
> environment use CapacityScheduler which open asyncSchedule function and 
> preemption
> Analyzing the jstack of active rm, we found a dead lock in it:
> thread one( refreshqueue thread):
>  * take write lock of capacity scheduler
>  * take write lock of preemptionManager 
>  * wait read lock of root queue
> thread two (asyncScheduleThread)  
>  * take read lock of root queue
>  * wait write lock of PreemptionManager
> thread three (ipc handler on 8030 which deal the allocate )
>  * wait write lock of root queue
> These three thread work with a dead lock.
>  
> The deadlock happens because of a "bug" of ReadWriteLock: writeLock request 
> blocks future readLock despite policy 
> unfair([https://bugs.openjdk.java.net/browse/JDK-6893626).] In order to solve 
> this problem, we change the logic of  refreshqueue thread, get a queue info 
> copy first and avoid the thread to take write lock of preemptionManager  and 
> read lock of root queue at the same time.
>  
> We test our new code in our production environment and the refresh queue 
> command works well.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9163) Deadlock when use yarn rmadmin -refreshQueues

2018-12-29 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730648#comment-16730648
 ] 

Hadoop QA commented on YARN-9163:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
16s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 17s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
28s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 41s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 9 new + 91 unchanged - 1 fixed = 100 total (was 92) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 39s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
37s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 29s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}171m 16s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | 
module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 |
|  |  Should 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$ClonedQueue
 be a _static_ inner class?  At CapacityScheduler.java:inner class?  At 
CapacityScheduler.java:[lines 792-806] |
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacitySchedulerSurgicalPreemption
 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9163 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12953299/YARN-9163.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 16ab5811e1b5 4.4.0-138-generic #164~14.04.1-