[
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311703#comment-15311703
]
Michael Han commented on ZOOKEEPER-2152:
----------------------------------------
Hi Alex,
bq. 1) AFAIU the state of the servers shouldn't matter for this test, as this
is the client-side library.
Correct, state of server does not matter here. In fact, there is no server
running at all for C client reconfiguration test: the server list used in the
tests are none existing, artificially created. My understanding is
TestReconfig.cc was designed to cover the pure client side reconfig logic
implemented in ZK C client, so no real server seems fine.
bq. 2) I assumed each client has independent state from other clients,
Correct, each client (class Client, implemented in TestReconfig.cc) has
independent state.
bq. but you're indicating that this isn't true. Why is that ? Are multiple
clients sharing the same state ?
No, multiple clients does not share same state IIUC. What I was indicating was
that for a given client, its state (independent from other clients) is
controlled by both test logic, and by ZK client IO thread, through call into
zoo_cycle_next_server function implemented in zookeeper.c. This is a problem
because the test cases are built oh top of the assumption that the state of
client (e.g. the currently connected server) is exclusively controlled /
determined by test logic itself. Thus, the interfere of the state from IO
thread will break such assumption thus making the tests failed. Let's walk
through a concrete test case in testMigrateOrNot():
{code}
// Here we create a list of servers.
const string initial_hosts = createHostList(4); // 2004..2001
// Explicitly specify that client should connect to server 10.10.10.3.
// All the following test cases are built on top of this assumption.
// Unfortunately, this is not always true, because the ZK C client IO thread
// could change the state of the client as well (through zoo_cycle_next_server).
Client &client = createClient(initial_hosts, "10.10.10.3");
// At this point, the 'currently connected server' of this client could be
10.10.10.3, or could be 10.10.10.4, or something else.
// If it's 10.10.10.4, then the following test will fail, because changing
ensemble from 10.10.10.4 to {10.10.10.3, 10.10.10.2, 10.10.10.1} will trigger a
reconfiguration (second parameter should thus be true, instead of false.).
// Ensemble size decreasing, my server is in the new list
client.setServersAndVerifyReconfig(createHostList(3), false);
{code}
bq. Does the Java client library have the same issue ? The C test should be
more or less a copy of the Java one. Or why isn't the state sharing happening
there ?
I haven't dig deep into Java side of reconfig client, so here is my current
understanding: there is no state sharing between clients on C client as
previously described, so we are good on this one with Java client. In terms of
ReconfigTest logic difference, one difference is Java ReconfigTest have set up
QuorumPeer while C client test does not have any real, or simulated server
entities. The other difference is the ZK C client has a dedicated IO thread
that will interfere with TestReconfig code through zoo_cycle_next_server, and I
don't spot anything like in Java ReconfigTest and Java client.
As a summary, the problem here is a data race caused by two threads change
state of same object (currently connected server of a specific ZK client.).
> Intermittent failure in TestReconfig.cc
> ---------------------------------------
>
> Key: ZOOKEEPER-2152
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
> Project: ZooKeeper
> Issue Type: Sub-task
> Components: c client
> Reporter: Michi Mutsuzaki
> Assignee: Michael Han
> Labels: reconfiguration
> Fix For: 3.6.0
>
> Attachments: ZOOKEEPER-2152.patch
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec]
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
> Assertion: assertion failed [Expression: found != string::npos,
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)