Re: Problem recovering from a bad reconfig (3.5)

Alexander Shraer Sat, 09 Nov 2013 18:33:53 -0800

Hi,

I tried running your script but there were too much changes needed for
it to work so I just run your scenario manually in my setup, for lack
of time, and steps 1-4 worked with no errors. This of course doesn't
mean much, but fwiw, here's some general thoughts...

If step 2 completed for sure (at least on the leader), that is, you
can see the new config on one of the servers, the error you're seeing
shouldn't be happening. So this may already be a bug, or some issue
with the setup. The error should only happen if there is an
outstanding reconfig on the leader, which was proposed but not yet
committed.

Even if step 2 hasn't really completed when step 3 starts ant this
error happens, it should be transient - if you just retry it should
usually succeed (especially that you have only one entity
orchestrating reconfigs). If it is stuck in a state where it
continuously issues this error, and both servers 1 and 2 are up, then
there's probably a bug. (There is actually a related JIRA
https://issues.apache.org/jira/browse/ZOOKEEPER-1699
but I really doubt that this is what you're seeing).

In step 3, since server 3 successfully connects to the leader (the
error message you mention comes from the leader, thrown in line 522 of
PrepRequestProcessor.java) its not important that its initial config
includes only 2 and 3 in your scenario.

I think that the risk of starting server a new server with a partial
view of the system (and not all servers in current config + the
joining server) is that there's a chance that the servers it tries to
contact are all down, in which case you'll need to start it again with
a different server list. I guess this is what you're doing in step 5,
but I didn't understand why you're doing this here - in your scenario
3 found the leader and encountered a transient error, no need to
restart it, just try again.

other things:
- please keep in mind that patch 1691 may still need some work
- don't include a version in the dynamic config file. the system
writes out versions automatically, the users should never
specify them.

Alex

On Sat, Nov 9, 2013 at 10:59 AM, zk questions <[email protected]> wrote:
> Hi,
>
> I've been testing out the dynamic reconfig feature of 3.5 along with using
> this patch (https://issues.apache.org/jira/browse/ZOOKEEPER-1691) and I'm
> having an issue where my zk cluster won't allow me to perform further
> reconfigs.
> So here's what I'm doing:
> 1) Start nodes 1 and 2
> 2) Invoke reconfig on 1 to add 2; this suceeds
> 3) Start node 3 with the initial configuration with the dynamic config set
> to just 2 and 3, where 2 isn't a leader (manually verified)
> 4) Invoke reconfig on 2 to add 3; this fails, with an error indicating that
> another reconfig in progress
> 5) Then I restart 3 with the configuration containing just 1 and 3
> 6) Then I try again to add 3 to the cluster by invoking reconfig on 1 to add
> 3; and again I see an error indicating that another reconfig is in progress
>
> FWIW: I'm testing this scenario to simulate the situation where I'm
> automating the reconfig process and the dynamic configuration for 3 ends up
> containing a node that isn't the leader.
>
> I was wondering what I should do in this situation to recover from the
> failure at step 3 so that we can fix the dynamic config and then attempt a
> proper reconfig (steps 4 - 6)?
>
> I've also attached a tar containing a script to automatically reproduce the
> steps and problem I'm seeing above.
>
> Thanks.

Re: Problem recovering from a bad reconfig (3.5)

Reply via email to