[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244050#comment-14244050 ]
Rohith commented on YARN-2946: ------------------------------ Another deadlock detected in different flow after fixing previous deadlock :-( Basically, by convention all the locks should acquire from *StateMachine.doTransition() -> zkRMStateStore.class* or directly zkRMStateStore.class , but in {{RMStateStore#isFencedState()}} method locking order is reversed i.e *zkRMStateStore.class -> StateMachine.doTransition()* {noformat} Found one Java-level deadlock: ============================= "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread": waiting to lock monitor 0x0000000000e55698 (object 0x00000000c0272cb0, a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine), which is held by "AsyncDispatcher event handler" "AsyncDispatcher event handler": waiting to lock monitor 0x00000000013adcf8 (object 0x00000000c0272b10, a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore), which is held by "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread" Java stack information for the threads listed above: =================================================== "org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread": at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.getCurrentState(StateMachineFactory.java:442) - waiting to lock <0x00000000c0272cb0> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.isFencedState(RMStateStore.java:693) - locked <0x00000000c0272b10> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:1020) "AsyncDispatcher event handler": at java.lang.Object.wait(Native Method) - waiting on <0x00000000c0272b10> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1043) - locked <0x00000000c0272b10> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1070) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:906) - locked <0x00000000c0272b10> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:920) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:929) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:608) - locked <0x00000000c0272b10> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:146) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:131) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) - locked <0x00000000c0272cb0> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:699) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:754) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:749) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Found 1 deadlock. {noformat} > Deadlock in ZKRMStateStore > -------------------------- > > Key: YARN-2946 > URL: https://issues.apache.org/jira/browse/YARN-2946 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.0 > Reporter: Rohith > Assignee: Rohith > Priority: Blocker > Attachments: 0001-YARN-2946.patch, 0002-YARN-2946.patch, > TestYARN2946.java > > > Found one deadlock in ZKRMStateStore. > # Initial stage zkClient is null because of zk disconnected event. > # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to > re establish zookeeper connection either via synconnected or expired event, > it is highly possible that any other thred can obtain lock on > {{ZKRMStateStore.this}} from state machine transition events. This cause > Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)