[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15311703#comment-15311703
 ] 

Michael Han commented on ZOOKEEPER-2152:
----------------------------------------

Hi Alex,

bq. 1) AFAIU the state of the servers shouldn't matter for this test, as this 
is the client-side library.
Correct, state of server does not matter here. In fact, there is no server 
running at all for C client reconfiguration test: the server list used in the 
tests are none existing, artificially created. My understanding is 
TestReconfig.cc was designed to cover the pure client side reconfig logic 
implemented in ZK C client, so no real server seems fine.

bq. 2) I assumed each client has independent state from other clients,
Correct, each client (class Client, implemented in TestReconfig.cc) has 
independent state.

bq. but you're indicating that this isn't true. Why is that ? Are multiple 
clients sharing the same state ?
No, multiple clients does not share same state IIUC. What I was indicating was 
that for a given client, its state (independent from other clients) is 
controlled by both test logic, and by ZK client IO thread, through call into 
zoo_cycle_next_server function implemented in zookeeper.c. This is a problem 
because the test cases are built oh top of the assumption that the state of 
client (e.g. the currently connected server) is exclusively controlled / 
determined by test logic itself. Thus, the interfere of the state from IO 
thread will break such assumption thus making the tests failed. Let's walk 
through a concrete test case in testMigrateOrNot():
{code}
// Here we create a list of servers. 
const string initial_hosts = createHostList(4); // 2004..2001

// Explicitly specify that client should connect to server 10.10.10.3. 
// All the following test cases are built on top of this assumption. 
// Unfortunately, this is not always true, because the ZK C client IO thread 
// could change the state of the client as well (through zoo_cycle_next_server).
Client &client = createClient(initial_hosts, "10.10.10.3");

// At this point, the 'currently connected server' of this client could be 
10.10.10.3, or could be 10.10.10.4, or something else.
// If it's 10.10.10.4, then the following test will fail, because changing 
ensemble from 10.10.10.4 to {10.10.10.3, 10.10.10.2, 10.10.10.1} will trigger a 
reconfiguration (second parameter should thus be true, instead of false.).

// Ensemble size decreasing, my server is in the new list
client.setServersAndVerifyReconfig(createHostList(3), false);
{code}

bq. Does the Java client library have the same issue ? The C test should be 
more or less a copy of the Java one. Or why isn't the state sharing happening 
there ?
I haven't dig deep into Java side of reconfig client, so here is my current 
understanding: there is no state sharing between clients on C client as 
previously described, so we are good on this one with Java client. In terms of 
ReconfigTest logic difference, one difference is Java ReconfigTest have set up 
QuorumPeer while C client test does not have any real, or simulated server 
entities. The other difference is the ZK C client has a dedicated IO thread 
that will interfere with TestReconfig code through zoo_cycle_next_server, and I 
don't spot anything like in Java ReconfigTest and Java client. 

As a summary, the problem here is a data race caused by two threads change 
state of same object (currently connected server of a specific ZK client.). 



> Intermittent failure in TestReconfig.cc
> ---------------------------------------
>
>                 Key: ZOOKEEPER-2152
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2152
>             Project: ZooKeeper
>          Issue Type: Sub-task
>          Components: c client
>            Reporter: Michi Mutsuzaki
>            Assignee: Michael Han
>              Labels: reconfiguration
>             Fix For: 3.6.0
>
>         Attachments: ZOOKEEPER-2152.patch
>
>
> I'm seeing this failure in the c client test once in a while:
> {noformat}
> [exec] 
> /home/jenkins/jenkins-slave/workspace/ZooKeeper-trunk/trunk/src/c/tests/TestReconfig.cc:474:
>  Assertion: assertion failed [Expression: found != string::npos, 
> 10.10.10.4:2004 not in newComing list]
> {noformat}
> https://builds.apache.org/job/ZooKeeper-trunk/2640/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to