[ https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311703#comment-15311703 ]
Michael Han commented on ZOOKEEPER-2152: ---------------------------------------- Hi Alex, bq. 1) AFAIU the state of the servers shouldn't matter for this test, as this is the client-side library. Correct, state of server does not matter here. In fact, there is no server running at all for C client reconfiguration test: the server list used in the tests are none existing, artificially created. My understanding is TestReconfig.cc was designed to cover the pure client side reconfig logic implemented in ZK C client, so no real server seems fine. bq. 2) I assumed each client has independent state from other clients, Correct, each client (class Client, implemented in TestReconfig.cc) has independent state. bq. but you're indicating that this isn't true. Why is that ? Are multiple clients sharing the same state ? No, multiple clients does not share same state IIUC. What I was indicating was that for a given client, its state (independent from other clients) is controlled by both test logic, and by ZK client IO thread, through call into zoo_cycle_next_server function implemented in zookeeper.c. This is a problem because the test cases are built oh top of the assumption that the state of client (e.g. the currently connected server) is exclusively controlled / determined by test logic itself. Thus, the interfere of the state from IO thread will break such assumption thus making the tests failed. Let's walk through a concrete test case in testMigrateOrNot(): {code} // Here we create a list of servers. const string initial_hosts = createHostList(4); // 2004..2001 // Explicitly specify that client should connect to server 10.10.10.3. // All the following test cases are built on top of this assumption. // Unfortunately, this is not always true, because the ZK C client IO thread // could change the state of the client as well (through zoo_cycle_next_server). Client &client = createClient(initial_hosts, "10.10.10.3"); // At this point, the 'currently connected server' of this client could be 10.10.10.3, or could be 10.10.10.4, or something else. // If it's 10.10.10.4, then the following test will fail, because changing ensemble from 10.10.10.4 to {10.10.10.3, 10.10.10.2, 10.10.10.1} will trigger a reconfiguration (second parameter should thus be true, instead of false.). // Ensemble size decreasing, my server is in the new list client.setServersAndVerifyReconfig(createHostList(3), false); {code} bq. Does the Java client library have the same issue ? The C test should be more or less a copy of the Java one. Or why isn't the state sharing happening there ? I haven't dig deep into Java side of reconfig client, so here is my current understanding: there is no state sharing between clients on C client as previously described, so we are good on this one with Java client. In terms of ReconfigTest logic difference, one difference is Java ReconfigTest have set up QuorumPeer while C client test does not have any real, or simulated server entities. The other difference is the ZK C client has a dedicated IO thread that will interfere with TestReconfig code through zoo_cycle_next_server, and I don't spot anything like in Java ReconfigTest and Java client. As a summary, the problem here is a data race caused by two threads change state of same object (currently connected server of a specific ZK client.). > Intermittent failure in TestReconfig.cc > --------------------------------------- > > Key: ZOOKEEPER-2152 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152 > Project: ZooKeeper > Issue Type: Sub-task > Components: c client > Reporter: Michi Mutsuzaki > Assignee: Michael Han > Labels: reconfiguration > Fix For: 3.6.0 > > Attachments: ZOOKEEPER-2152.patch > > > I'm seeing this failure in the c client test once in a while: > {noformat} > [exec] > /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474: > Assertion: assertion failed [Expression: found != string::npos, > 10.10.10.4:2004 not in newComing list] > {noformat} > https://builds.apache.org/job/ZooKeeper-trunk/2640/console -- This message was sent by Atlassian JIRA (v6.3.4#6332)