[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Michael Han (JIRA) Wed, 27 Jul 2016 11:16:35 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15396091#comment-15396091
 ]


Michael Han commented on ZOOKEEPER-2080:
----------------------------------------

Hi Alex, thanks for the review :)
bq. do you think that the creation of a new election object won't be interfered 
if the old object shutdown/GC hasn't happened yet 

The new leader election object and the old leader election object does not 
share object state: each object has their own QuorumCnxManager that manages the 
underlying TCP connections used for leader election. They could in theory 
possibly share the same socket address (election address), because I believe 
this address is statically generated from the connection string instead of 
dynamically generated (like the uniquePort utility we had in test), and this 
address seems to be only thing that different QuroumCnxManager shares. In 
theory we might have two QuorumCnxManager, one from old election object waiting 
to be shutdown and the other one from the new election object, that both try 
binding to same address. I haven't found any issues related this though during 
my stress test on unit tests (in particular for reconfig test), and I think we 
could possibly address this issue by some retry logic with exponential back off 
when binding to socket in QuorumCnxManager.

bq. any way to test this using a unit test
I don't have any concrete ideas around this, my thinking is we could possibly 
expose some options from related classes under test so we can artificially 
inject faults, creating race conditions and control timings. For example we 
could delay the shut down of the old leader election object and see what 
happens. As a simple test, I simply remove the statement completely and 5 out 
of 6 ReconfigRecoveryTest tests failed, which is expected because that is not 
supposed to be completely removed, so maybe instead of removing we can add a 
delay and make sure everything still works.




> ReconfigRecoveryTest fails intermittently
> -----------------------------------------
>
>                 Key: ZOOKEEPER-2080
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>             Project: ZooKeeper
>          Issue Type: Sub-task
>            Reporter: Ted Yu
>            Assignee: Michael Han
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, repro-20150816.log, 
> threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

Reply via email to