Re: Problem recovering from a bad reconfig (3.5)

Alexander Shraer Sat, 09 Nov 2013 18:37:47 -0800

btw, please update to the latest trunk - there was one
reconfig-related patch committed a day ago


On Sat, Nov 9, 2013 at 6:32 PM, Alexander Shraer <[email protected]> wrote:
> Hi,
>
> I tried running your script but there were too much changes needed for
> it to work so I just run your scenario manually in my setup, for lack
> of time, and steps 1-4 worked with no errors. This of course doesn't
> mean much, but fwiw, here's some general thoughts...
>
> If step 2 completed for sure (at least on the leader), that is, you
> can see the new config on one of the servers, the error you're seeing
> shouldn't be happening. So this may already be a bug, or some issue
> with the setup. The error should only happen if there is an
> outstanding reconfig on the leader, which was proposed but not yet
> committed.
>
> Even if step 2 hasn't really completed when step 3 starts ant this
> error happens, it should be transient - if you just retry it should
> usually succeed (especially that you have only one entity
> orchestrating reconfigs). If it is stuck in a state where it
> continuously issues this error, and both servers 1 and 2 are up, then
> there's probably a bug. (There is actually a related JIRA
> https://issues.apache.org/jira/browse/ZOOKEEPER-1699
> but I really doubt that this is what you're seeing).
>
> In step 3, since server 3 successfully connects to the leader (the
> error message you mention comes from the leader, thrown in line 522 of
> PrepRequestProcessor.java) its not important that its initial config
> includes only 2 and 3 in your scenario.
>
> I think that the risk of starting server a new server with a partial
> view of the system (and not all servers in current config + the
> joining server) is that there's a chance that the servers it tries to
> contact are all down, in which case you'll need to start it again with
> a different server list. I guess this is what you're doing in step 5,
> but I didn't understand why you're doing this here - in your scenario
> 3 found the leader and encountered a transient error, no need to
> restart it, just try again.
>
> other things:
> - please keep in mind that patch 1691 may still need some work
> - don't include a version in the dynamic config file. the system
> writes out versions automatically, the users should never
> specify them.
>
> Alex
>
> On Sat, Nov 9, 2013 at 10:59 AM, zk questions <[email protected]> wrote:
>> Hi,
>>
>> I've been testing out the dynamic reconfig feature of 3.5 along with using
>> this patch (https://issues.apache.org/jira/browse/ZOOKEEPER-1691) and I'm
>> having an issue where my zk cluster won't allow me to perform further
>> reconfigs.
>> So here's what I'm doing:
>> 1) Start nodes 1 and 2
>> 2) Invoke reconfig on 1 to add 2; this suceeds
>> 3) Start node 3 with the initial configuration with the dynamic config set
>> to just 2 and 3, where 2 isn't a leader (manually verified)
>> 4) Invoke reconfig on 2 to add 3; this fails, with an error indicating that
>> another reconfig in progress
>> 5) Then I restart 3 with the configuration containing just 1 and 3
>> 6) Then I try again to add 3 to the cluster by invoking reconfig on 1 to add
>> 3; and again I see an error indicating that another reconfig is in progress
>>
>> FWIW: I'm testing this scenario to simulate the situation where I'm
>> automating the reconfig process and the dynamic configuration for 3 ends up
>> containing a node that isn't the leader.
>>
>> I was wondering what I should do in this situation to recover from the
>> failure at step 3 so that we can fix the dynamic config and then attempt a
>> proper reconfig (steps 4 - 6)?
>>
>> I've also attached a tar containing a script to automatically reproduce the
>> steps and problem I'm seeing above.
>>
>> Thanks.

Re: Problem recovering from a bad reconfig (3.5)

Reply via email to