btw, please update to the latest trunk - there was one reconfig-related patch committed a day ago
On Sat, Nov 9, 2013 at 6:32 PM, Alexander Shraer <[email protected]> wrote: > Hi, > > I tried running your script but there were too much changes needed for > it to work so I just run your scenario manually in my setup, for lack > of time, and steps 1-4 worked with no errors. This of course doesn't > mean much, but fwiw, here's some general thoughts... > > If step 2 completed for sure (at least on the leader), that is, you > can see the new config on one of the servers, the error you're seeing > shouldn't be happening. So this may already be a bug, or some issue > with the setup. The error should only happen if there is an > outstanding reconfig on the leader, which was proposed but not yet > committed. > > Even if step 2 hasn't really completed when step 3 starts ant this > error happens, it should be transient - if you just retry it should > usually succeed (especially that you have only one entity > orchestrating reconfigs). If it is stuck in a state where it > continuously issues this error, and both servers 1 and 2 are up, then > there's probably a bug. (There is actually a related JIRA > https://issues.apache.org/jira/browse/ZOOKEEPER-1699 > but I really doubt that this is what you're seeing). > > In step 3, since server 3 successfully connects to the leader (the > error message you mention comes from the leader, thrown in line 522 of > PrepRequestProcessor.java) its not important that its initial config > includes only 2 and 3 in your scenario. > > I think that the risk of starting server a new server with a partial > view of the system (and not all servers in current config + the > joining server) is that there's a chance that the servers it tries to > contact are all down, in which case you'll need to start it again with > a different server list. I guess this is what you're doing in step 5, > but I didn't understand why you're doing this here - in your scenario > 3 found the leader and encountered a transient error, no need to > restart it, just try again. > > other things: > - please keep in mind that patch 1691 may still need some work > - don't include a version in the dynamic config file. the system > writes out versions automatically, the users should never > specify them. > > Alex > > On Sat, Nov 9, 2013 at 10:59 AM, zk questions <[email protected]> wrote: >> Hi, >> >> I've been testing out the dynamic reconfig feature of 3.5 along with using >> this patch (https://issues.apache.org/jira/browse/ZOOKEEPER-1691) and I'm >> having an issue where my zk cluster won't allow me to perform further >> reconfigs. >> So here's what I'm doing: >> 1) Start nodes 1 and 2 >> 2) Invoke reconfig on 1 to add 2; this suceeds >> 3) Start node 3 with the initial configuration with the dynamic config set >> to just 2 and 3, where 2 isn't a leader (manually verified) >> 4) Invoke reconfig on 2 to add 3; this fails, with an error indicating that >> another reconfig in progress >> 5) Then I restart 3 with the configuration containing just 1 and 3 >> 6) Then I try again to add 3 to the cluster by invoking reconfig on 1 to add >> 3; and again I see an error indicating that another reconfig is in progress >> >> FWIW: I'm testing this scenario to simulate the situation where I'm >> automating the reconfig process and the dynamic configuration for 3 ends up >> containing a node that isn't the leader. >> >> I was wondering what I should do in this situation to recover from the >> failure at step 3 so that we can fix the dynamic config and then attempt a >> proper reconfig (steps 4 - 6)? >> >> I've also attached a tar containing a script to automatically reproduce the >> steps and problem I'm seeing above. >> >> Thanks.
