I don't often admit defeat; but I can't make heads or tails of the error handling (or lack thereof) in the reconfiguration code paths. If anybody wants to take a stab at explaining which parts of the processAck -> tryToCommit -> processReconfig -> reconfigure call chain should and shouldn't go through if the bind() call fails, maybe I can try to write tests that verify that and modify the code under test to behave accordingly. I've filed ZOOKEEPER-3198 as an umbrella for this work, and pushed what I've got to https://github.com/mkedwards/zookeeper/tree/broken-bind-3.5, in case somebody wants to try to take it forward from there.
In the meantime, I'm running tests in parallel inside a Docker container (with a code state that has patches applied for all three 3.5 blocker/critical Jiras). Nothing seems "flaky" yet. We'll deploy this in our QA environment next week, and throw some load at it, and see what happens. (And run the test suite a few hundred times, too.) Alex (or anyone else), do you consider any of the other outstanding Jiras to be obstacles to exercising the reconfiguration features in 3.5.x on a production cluster? How serious is https://issues.apache.org/jira/browse/ZOOKEEPER-2202 ? Is it related to https://issues.apache.org/jira/browse/ZOOKEEPER-2836 ? And how serious is https://issues.apache.org/jira/browse/ZOOKEEPER-1896 ? Does mixing 3.4.x and 3.5.x in the same cluster work? Is it best to disable reconfig while migrating cluster members from 3.4.x to 3.5.x, and then enable reconfig and do a rolling restart? On Sat, Nov 24, 2018 at 12:13 PM Alexander Shraer <[email protected]> wrote: > > Hi Michael, > > In general, one reconfig op is allowed at a time, and this error indicates > that one is already in progress. If there are enough peers to form a quorum a > failure to connect to one of them shouldn’t be a problem. If there is not > enough, the leader is supposed to give up leadership. This is true in > general, unrelated to reconfig. A new leader will be elected and complete any > reconfig in progress. That’s the theory at least, there may be a bug in the > case you found. > > Some general flow is described in Sec 3.2 of our paper, > https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf > > There are also the wiki docs but they don’t talk about recovery much. > https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html > > Btw > > > robustness against > Byzantine faults that one is led to expect from Zookeeper? > > ZK is not designed to handle Byzantine faults in general. It’s not to say > that there is no bug In the case you found. > > Thanks, > Alex > > On Sat, Nov 24, 2018 at 11:32 AM Michael K. Edwards <[email protected]> > wrote: >> >> I've been experimenting a bit with trying to propagate failures to >> bind() server ports in tests up to where we can do something about it. >> There's at least one category of test cases (callers of >> ReconfigTest.testPortChangeToBlockedPort) where the server is supposed >> to ride through a bind() failure, recovering on a subsequent >> reconfiguration. In my current code state, I'm encountering errors >> like this: >> >> 2018-11-24 11:04:46,252 [myid:] - INFO [ProcessThread(sid:3 >> cport:-1)::PrepRequestProcessor@878] - Got user-level KeeperException >> when processing sessionid:0x1002b98aa830000 type:reconfig cxid:0x1e >> zxid:0x10000002b txntype:-1 reqpath:n/a Error Path:null >> Error:KeeperErrorCode = ReconfigInProgress >> >> I can hack things until this particular test passes, but it raises >> questions about reconfiguration in general. How exactly is the >> cluster supposed to get out of this state? If a cluster member drops >> out of contact with the quorum while there is a reconfiguration in >> flight, is there any recovery path that restores the ability to >> process a reconfigure operation? Is there a design doc for >> reconfiguration that demonstrates the kind of robustness against >> Byzantine faults that one is led to expect from Zookeeper?
