[
https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244142#comment-14244142
]
Varun Saxena commented on YARN-2946:
[~rohithsharma], good catch. As you said {{synchronized}} can be removed from
{{RMStateStore#isFencedState()}}.
Regarding the methods you listed out, I think we can avoid reverse locking i.e.
*ZKRMStateStore.class - StateMachine.doTransition()* by making the following
changes.
# Remove {{synchronized}} keyword from each one of the methods listed above.
# Separate out State machine synchronization(invoked by call to
{{isFencedState}} {{notifyStoreOperationFailed}}) and ZKRMStateStore
synchronization by putting the relevant code in a synchronized block.
For example, code for RMStateStore#storeRMDTMasterKey() can be changed as under
:
{code:title=RMStateStore.java|borderStyle=solid}
public void storeRMDTMasterKey(DelegationKey delegationKey) {
if(isFencedState()) {
LOG.info(State store is in Fenced state. Can't store RM Delegation +
Token Master key.);
return;
}
try {
synchronized(this) {
storeRMDTMasterKeyState(delegationKey);
}
} catch (Exception e) {
notifyStoreOperationFailed(e);
}
}
{code}
DeadLock's in RMStateStore-ZKRMStateStore
---
Key: YARN-2946
URL: https://issues.apache.org/jira/browse/YARN-2946
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.7.0
Reporter: Rohith
Assignee: Rohith
Priority: Blocker
Attachments: 0001-YARN-2946.patch, 0002-YARN-2946.patch,
TestYARN2946.java
Found one deadlock in ZKRMStateStore.
# Initial stage zkClient is null because of zk disconnected event.
# When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to
re establish zookeeper connection either via synconnected or expired event,
it is highly possible that any other thred can obtain lock on
{{ZKRMStateStore.this}} from state machine transition events. This cause
Deadlock in ZKRMStateStore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)