[
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396091#comment-15396091
]
Michael Han commented on ZOOKEEPER-2080:
----------------------------------------
Hi Alex, thanks for the review :)
bq. do you think that the creation of a new election object won't be interfered
if the old object shutdown/GC hasn't happened yet
The new leader election object and the old leader election object does not
share object state: each object has their own QuorumCnxManager that manages the
underlying TCP connections used for leader election. They could in theory
possibly share the same socket address (election address), because I believe
this address is statically generated from the connection string instead of
dynamically generated (like the uniquePort utility we had in test), and this
address seems to be only thing that different QuroumCnxManager shares. In
theory we might have two QuorumCnxManager, one from old election object waiting
to be shutdown and the other one from the new election object, that both try
binding to same address. I haven't found any issues related this though during
my stress test on unit tests (in particular for reconfig test), and I think we
could possibly address this issue by some retry logic with exponential back off
when binding to socket in QuorumCnxManager.
bq. any way to test this using a unit test
I don't have any concrete ideas around this, my thinking is we could possibly
expose some options from related classes under test so we can artificially
inject faults, creating race conditions and control timings. For example we
could delay the shut down of the old leader election object and see what
happens. As a simple test, I simply remove the statement completely and 5 out
of 6 ReconfigRecoveryTest tests failed, which is expected because that is not
supposed to be completely removed, so maybe instead of removing we can add a
delay and make sure everything still works.
> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
> Issue Type: Sub-task
> Reporter: Ted Yu
> Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch,
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log,
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
> FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
> at
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
> at
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)