Hi Michael, In general, one reconfig op is allowed at a time, and this error indicates that one is already in progress. If there are enough peers to form a quorum a failure to connect to one of them shouldn’t be a problem. If there is not enough, the leader is supposed to give up leadership. This is true in general, unrelated to reconfig. A new leader will be elected and complete any reconfig in progress. That’s the theory at least, there may be a bug in the case you found.
Some general flow is described in Sec 3.2 of our paper, https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf There are also the wiki docs but they don’t talk about recovery much. https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html Btw > robustness against Byzantine faults that one is led to expect from Zookeeper? ZK is not designed to handle Byzantine faults in general. It’s not to say that there is no bug In the case you found. Thanks, Alex On Sat, Nov 24, 2018 at 11:32 AM Michael K. Edwards <[email protected]> wrote: > I've been experimenting a bit with trying to propagate failures to > bind() server ports in tests up to where we can do something about it. > There's at least one category of test cases (callers of > ReconfigTest.testPortChangeToBlockedPort) where the server is supposed > to ride through a bind() failure, recovering on a subsequent > reconfiguration. In my current code state, I'm encountering errors > like this: > > 2018-11-24 11:04:46,252 [myid:] - INFO [ProcessThread(sid:3 > cport:-1)::PrepRequestProcessor@878] - Got user-level KeeperException > when processing sessionid:0x1002b98aa830000 type:reconfig cxid:0x1e > zxid:0x10000002b txntype:-1 reqpath:n/a Error Path:null > Error:KeeperErrorCode = ReconfigInProgress > > I can hack things until this particular test passes, but it raises > questions about reconfiguration in general. How exactly is the > cluster supposed to get out of this state? If a cluster member drops > out of contact with the quorum while there is a reconfiguration in > flight, is there any recovery path that restores the ability to > process a reconfigure operation? Is there a design doc for > reconfiguration that demonstrates the kind of robustness against > Byzantine faults that one is led to expect from Zookeeper? >
