[
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740513#comment-14740513
]
Akihiro Suda commented on ZOOKEEPER-2080:
-----------------------------------------
Looking at JaCoCo reports, I also noticed that
[{{QCM.SendWorker#finish()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L352-L360]
(and hence {{QCM.RecvWorker#finish()}}) in {{QCM#receiveConnection()}} ({{sid
< self.getId()}}) is called only on failed experiments.
When I comment out this, the bug got hard to be reproduced.
So I belive that the bug is caused by *a race condition between TCP packet
arrivals and {{SendWorker}}/{{RecvWorker}} lifecycles*.
Especially, the socket handling in
[{{QCM.RecvWorker#run}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L893-L926]
is *very suspicious*, as it cannot be interrupted nor timed out.
(Should use {{java.nio.channels.SocketChannel}} rather than plain old
{{java.net.Socket}}.)
Note that the bug also got hard to be reproduced when I comment out
[{{Socket#setTcpNoDelay(true)}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L566]
in {{QCM#setSockOpts()}} (as I reported on Aug 14), or use
{{BufferedOutputStream}} instead of {{DataOutputStream}} in
[{{QCM.SendWorker()}}|https://github.com/apache/zookeeper/blob/df7d56d25d38f872b5793af365ef732c4478eb1d/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L719].
[~shralex], can I have your opinion on this?
> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
> Issue Type: Sub-task
> Reporter: Ted Yu
> Assignee: Raul Gutierrez Segales
> Priority: Minor
> Attachments: jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z,
> repro-20150816.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
> FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
> at
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
> at
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)