Hi, I tried running your script but there were too much changes needed for it to work so I just run your scenario manually in my setup, for lack of time, and steps 1-4 worked with no errors. This of course doesn't mean much, but fwiw, here's some general thoughts...
If step 2 completed for sure (at least on the leader), that is, you can see the new config on one of the servers, the error you're seeing shouldn't be happening. So this may already be a bug, or some issue with the setup. The error should only happen if there is an outstanding reconfig on the leader, which was proposed but not yet committed. Even if step 2 hasn't really completed when step 3 starts ant this error happens, it should be transient - if you just retry it should usually succeed (especially that you have only one entity orchestrating reconfigs). If it is stuck in a state where it continuously issues this error, and both servers 1 and 2 are up, then there's probably a bug. (There is actually a related JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-1699 but I really doubt that this is what you're seeing). In step 3, since server 3 successfully connects to the leader (the error message you mention comes from the leader, thrown in line 522 of PrepRequestProcessor.java) its not important that its initial config includes only 2 and 3 in your scenario. I think that the risk of starting server a new server with a partial view of the system (and not all servers in current config + the joining server) is that there's a chance that the servers it tries to contact are all down, in which case you'll need to start it again with a different server list. I guess this is what you're doing in step 5, but I didn't understand why you're doing this here - in your scenario 3 found the leader and encountered a transient error, no need to restart it, just try again. other things: - please keep in mind that patch 1691 may still need some work - don't include a version in the dynamic config file. the system writes out versions automatically, the users should never specify them. Alex On Sat, Nov 9, 2013 at 10:59 AM, zk questions <[email protected]> wrote: > Hi, > > I've been testing out the dynamic reconfig feature of 3.5 along with using > this patch (https://issues.apache.org/jira/browse/ZOOKEEPER-1691) and I'm > having an issue where my zk cluster won't allow me to perform further > reconfigs. > So here's what I'm doing: > 1) Start nodes 1 and 2 > 2) Invoke reconfig on 1 to add 2; this suceeds > 3) Start node 3 with the initial configuration with the dynamic config set > to just 2 and 3, where 2 isn't a leader (manually verified) > 4) Invoke reconfig on 2 to add 3; this fails, with an error indicating that > another reconfig in progress > 5) Then I restart 3 with the configuration containing just 1 and 3 > 6) Then I try again to add 3 to the cluster by invoking reconfig on 1 to add > 3; and again I see an error indicating that another reconfig is in progress > > FWIW: I'm testing this scenario to simulate the situation where I'm > automating the reconfig process and the dynamic configuration for 3 ends up > containing a node that isn't the leader. > > I was wondering what I should do in this situation to recover from the > failure at step 3 so that we can fix the dynamic config and then attempt a > proper reconfig (steps 4 - 6)? > > I've also attached a tar containing a script to automatically reproduce the > steps and problem I'm seeing above. > > Thanks.
