[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244098#comment-14244098 ]
Rohith commented on YARN-2946: ------------------------------ Above 2 deadlock's can be directly fixed by removing *synchronized* keyword which was not really required. But I see there many other potential deadlocks which can appear easily in the following methods. Below all methods does reverse locking i.e *zkRMStateStore.class -> StateMachine.doTransition()* through {{RMStateStore#isFencedState()}} # Method {{RMStateStore#storeRMDTMasterKey()}} # Method {{RMStateStore#removeRMDTMasterKey()}} # Method {{RMStateStore#storeRMDelegationTokenAndSequenceNumber()}} # Method {{RMStateStore#removeRMDelegationToken()}} # Method {{RMStateStore#updateRMDelegationTokenAndSequenceNumber()}} # Method {{ZKRMStateStore#storeOrUpdateAMRMTokenSecretManagerState()}} So only fixing 1 or 2 deadlock flows does not really fix other potential dead lock issues. *I propose following solution to handle all these deadlock flows* Option-1 : # For all above mentioned method's causing deadlock , introduce StateMachine in RMStateStore like handling application store. So all the execution flows from StateMachine->zkRMStateStore.class. # Along with 1st , StateMachine should be guarded with Read-Write lock. Option-2 : # Fix the visible eadlocks i.e 2 found in this jira. And Option-1 do in separate improvement task. Handling all the deadlock flows, i would like to do in one umbrella jira. This is to ensure we do not miss any these deadlock flows. Please let me your suggestions/thoughts? > Deadlock in ZKRMStateStore > -------------------------- > > Key: YARN-2946 > URL: https://issues.apache.org/jira/browse/YARN-2946 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.7.0 > Reporter: Rohith > Assignee: Rohith > Priority: Blocker > Attachments: 0001-YARN-2946.patch, 0002-YARN-2946.patch, > TestYARN2946.java > > > Found one deadlock in ZKRMStateStore. > # Initial stage zkClient is null because of zk disconnected event. > # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to > re establish zookeeper connection either via synconnected or expired event, > it is highly possible that any other thred can obtain lock on > {{ZKRMStateStore.this}} from state machine transition events. This cause > Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)