[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006716#comment-13006716
 ] 

Alexander Shraer commented on ZOOKEEPER-107:
--------------------------------------------

Hi Vishal,

Thanks a lot for your comments and for offering to help. I have some experience 
with reconfigurable solutions however
I'm new to the ZooKeeper project and your help will be very appreciated. I 
suggest to start by figuring out what we'd 
like to do and then we can decide who does what. 

As you suggested I numbered the sections so it's easier to refer to. Some of 
the things you point out are very important but I think they are separate from 
the core design so we can address them separately.

A. I agree that bootstrapping the cluster should probably be done this way, but 
this seems like a separate issue - in this case the reconfiguration is very 
simple since the state is empty and we can block until the configuration 
connects. One related thing - in my proposal I suggested that the server that 
gets the reconfig request from a client sends a message to members(M') 
instructing them to connect to leader(M). This can be done by invoking a 
join(M) command on the servers.

B. This is an important issue and I definitely agree we should discuss this, 
but I think this is also orthogonal to any solution we choose now.

C. Agree, second approach looks good (leader periodically broadcasts 
membership, or at least the configuration id).

D. The purpose of connecting is only to begin state transfer, so only 
connecting to the leader matters. For example, there is no need
for members(M) to detect failure of new servers that are trying to connect. If 
the leader fails then the new leader can be responsible for establishing 
connections to members(M') as part of completing the reconfiguration (see first 
step in Section 4.2).

> Also, in step 7, why do you send leader(M') along with <activate-config> 
> message?

The idea is that there is no need to explicitly run leader election in M' - the 
old leader can appoint one of members(M') to be the new leader (the new leader 
should be chosen s.t. it is up-to-date with M). This way leader election in M' 
is only required if the designated leader fails. Including leader(M') in the 
phase-3 message instructs it to start acting as leader of M'.

E. This is very related to ZOOKEEPER-22 (how does a client know whether its 
operation has been executed if the leader failed) ?
In this case though, it should be easy to detected whether the reconfiguration 
was completed or not - the recovery procedure will have the property that when 
a new leader is established then either the new leader has observed the 
reconfiguration attempt and completed it or no further leader will (if we fix 
ZOOKEEPER-335 :)).  

We should provide a way to query for the current configuration, both for the 
issue you mention and also for non-incremental reconfig API (I suggest that a 
client invoking a reconfig request should include the version of the current 
config in the request and if this config is no longer the current its request 
should be aborted, Section 4.3).

F. This is exactly what I meant - I think you didn't notice the "’" (it says 
"to be leader(M’)").

G. I agree that online would be better. I initially thought that it's not 
possible in this case, but now I think there's a way.

> a. Let new members(M') join the cluster as OBSERVERS. Based on the
> current implementation, M' will essentially sync with the leader after 
> becoming
> OBSERVERS and it will be easier to leverage the RequestProcessor pipeline for 
> reconfig.

This was also my intention - to let them connect to the leader and perform 
state transfer before the leader is requested to reconfigure.

> b. Once a majority of M' join the leader, the leader executes phase 1 as you 
> described.

I think a reconfig operations should be sent to the leader only after a 
majority of M' has connected to it. This way we avoid waiting
for this to happen in the leader code. The leader can check whether a majority 
is connected and not too far behind with regard to
state transfer and abort otherwise (line 1 in the algorithm).

> c. After phase 1., the leader changes the transaction logic a bit. Every 
> transaction after this point (including reconfig-start) will be sent to M and 
> M'.  
> Leader then waits for quorum(M) and quorum(M') to ack the transaction. So M' 
> are not pure OBSERVERS as we think today. However, only a member of
> M can become a leader until reconfig succeeds.  Also, M' - (M n M') cannot 
> serve client requests until reconfiguration is complete. By doing a
> transaction on both M and M' and waiting for the quorum of each set to ack, 
> we keep transferring the state to both the configurations.
> d. After receiving ack for phase 2, the leader sends <switch-to-new-config> 
> transaction to M and M' (instead of sending retire just to M).
> After receiving this message, M' will close connections with (and reject 
> connections from) members not in M'.  Members that are supposed to
> leave the cluster will shutdown QuorumPeer. 

Here are some issues that I see with your proposal:

1. Suppose that leader(M) fails while sending phase-3 messages and all phase-3 
messages arrive to members(M) but none arrive to members(M'). Now the system is 
stuck: M' - (M n M') cannot serve client requests whereas members(M) are no 
longer in the system.
This is why I think before garbage-collecting M we must make sure that M' is 
operational by itself. 

2. In my proposal members(M') can accept operations as soon as they get the 
phase-2 message, which is the purpose of phase 2. 
I don't see why phase-2 is needed in your proposal. I suggest to stick with 3 
phases s.t. phase-2 message allows M' to be independent.

3. Because of Global Primary Order property of ZAB we can't allow a commit 
message to arrive from leader(M) to members(M') after they are allowed to 
process
messages independently in M'. Your solution handles this because members(M') 
close connections to leader(M) as soon as they're allowed to process messages 
independently, but then its not clear how the outstanding ops get committed. In 
order to commit these operations nevertheless so I suggest that this would be 
the task of leader(M'). 
 
Here's what I suggest:

- after leader(M) sends phase-1 message to members(M) it starts sending every 
received new op to both members(M) and members(M') (treating them as 
followers). Commit messages must not be sent to members(M') by leader(M) and 
sending such messages to members(M) does not hurt but is not necessary. In any 
case, clients are not acked regarding these ops. 

- As soon as enough acks are received for the phase-1 message from members(M), 
leader(M) sends a phase-2 message to members(M') and from this moment doesn't 
accept new ops (unless leader(M) is also in M' and therefore acts also as 
leader(M')).
 
- When members(M') receive the phase-2 message from leader(M) they send an ack 
both to leader(M) and to leader(M') - in practice it can send an ack to 
leader(M) first, then disconnect from it and then connect to leader(M') and 
send an ack to it but it doesn't matter for correctness. When leader(M') 
receives this ack from a quorum of members(M') (as well as the phase-2 message 
from leader(M)) it can commit all preceding ops in M'. 

The thing is, until ZOOKEEPER-22 is fixed the fact that we're processing ops in 
M' doesn't really help since the clients won't be able to know that the ops got 
committed.

H.  I agree - it's an overkill :)  We can make sure that a quorum of M' is 
connected when the reconfiguration starts.

I.  I agree - it is important to fix ZOOKEEPER-335 (as well as ZOOKEEPER-22).

J. As long as a quorum of the old and the new config are up during the reconfig 
it seems like we can do the reconfig (e.g., the majority of 2 is 2). Currently 
a warning is issued if there are 2 nodes in a configuration, we can do that 
too. Actually, we need to define liveness more carefully, it's on my ToDo list 
:)

Best Regards,
Alex



> Allow dynamic changes to server cluster membership
> --------------------------------------------------
>
>                 Key: ZOOKEEPER-107
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-107
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Patrick Hunt
>            Assignee: Henry Robinson
>         Attachments: SimpleAddition.rtf
>
>
> Currently cluster membership is statically defined, adding/removing hosts 
> to/from the server cluster dynamically needs to be supported.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to