[
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519026#comment-15519026
]
Flavio Junqueira commented on ZOOKEEPER-2080:
---------------------------------------------
I suspect that {{connectOne}} needs to synchronize on self because it needs a
consistent view of {{self.getView()}} and {{self.getLastSeenQuorumVerifier()}}.
In fact, one interesting thing is that {{self.getView()}} is declared as:
{noformat}
public Map<Long,QuorumPeer.QuorumServer> getView() {
return Collections.unmodifiableMap(getQuorumVerifier().getAllMembers());
}
{noformat}
So all we really need is {{self.getLastSeenQuorumVerifier()}}.
The root cause of all these deadlocks seems to be that we are trying to get a
consistent view of the ensemble and locking {{QuorumPeer}} to guarantee
consistency. The complex interdependencies across classes is making it
difficult to guarantee that we don't have deadlocks. My suggestion is that we
take a different approach. Each class that needs a consistent view of
{{self.getLastSeenQuorumVerifier()}} will implement a listener that caches the
new value locally, and {{QuorumPeer}} will broadcast changes to the quorum
verifier to all listeners. Broadcasting can be done under a lock to prevent
races with other operations inside {{QuorumPeer}}. I think that if we do
something like this, we will be avoiding the circular dependencies and fixing
the deadlocks. The change doesn't seem to be super complex, but I could be
wrong, though.
> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
> Issue Type: Sub-task
> Reporter: Ted Yu
> Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch,
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch,
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log,
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
> FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
> at
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
> at
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)