Michael Han created ZOOKEEPER-2778:
--------------------------------------
Summary: Potential server deadlock between follower sync with
leader and follower receiving external connection requests.
Key: ZOOKEEPER-2778
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2778
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.5.3
Reporter: Michael Han
Assignee: Michael Han
Priority: Critical
It's possible to have a deadlock during recovery phase.
Found this issue by analyzing thread dumps of "flaky" ReconfigRecoveryTest. .
Here is a sample thread dump that illustrates the state of the execution:
{noformat}
[junit] java.lang.Thread.State: BLOCKED
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.getElectionAddress(QuorumPeer.java:686)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.initiateConnection(QuorumCnxManager.java:265)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:445)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.receiveConnection(QuorumCnxManager.java:369)
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:642)
[junit]
[junit] java.lang.Thread.State: BLOCKED
[junit] at
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:472)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.connectNewPeers(QuorumPeer.java:1438)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.setLastSeenQuorumVerifier(QuorumPeer.java:1471)
[junit] at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:520)
[junit] at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:88)
[junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
{noformat}
The dead lock happens between the quorum peer thread which running the follower
that doing sync with leader work, and the listener of the qcm of the same
quorum peer that doing the receiving connection work. Basically to finish sync
with leader, the follower needs to synchronize on both QV_LOCK and the qmc
object it owns; while in the receiver thread to finish setup an incoming
connection the thread needs to synchronize on both the qcm object the quorum
peer owns, and the same QV_LOCK. It's easy to see the problem here is the order
of acquiring two locks are different, thus depends on timing / actual execution
order, two threads might end up acquiring one lock while holding another.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)