Vishal/Alex, Would it be good to have these comments/design on the jira? Its probably better to keep design discussion/comments on the jira.
thanks mahadev On Sun, Mar 13, 2011 at 10:25 AM, Vishal Kher <[email protected]> wrote: > Hi Alex, > > Great! Thanks for the design. I have a few suggestions/questions below. > > A. Bootstrapping the cluster > The algorithm assumes that a leader exists prior to doing a reconfig. > So it looks like the reconfig() API is not intended to use for bootstrapping > the cluster. How about we define a API for initializing the cluster? > > Instead of relying on the current method of setting the initial > configuration > in the config file, we should probably also have to define a join(M) (or > init(M)) API. When a peer receives this request, it will try to connect to > the > specified peers. During bootstrap peers will connect to each other (as they > do > now) and elect a leader. > > B. Storing membership in znode and updating client (a tangential topic to > this > design). > Earlier ZOOKEEPER-107 proposed of using a URI based approach for the > clients to fetch server list. I am not opposed to the URI based approach, > however, that shouldn't be our only approach. URI based approach requires > extra > resources (e.g., fault tolerant web service or shared storage for file, > etc). > In certain environments it may not be feasible to have such a resource. > Instead, can we use a mechanism similar to ZooKeeper watchpoints for this? > Store the membership information in a znode and let the ZooKeeper server > inform > the clients of the changed configuration. > > C. Locating most recent config on servers > The URI based approach will be more helpful to servers. For example, if > A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes > online it won't be able to locate the most recent config. In this case, A > can > query the URI. Second approach is to ask the leader to try to periodically > send the membership (at least to nodes that are down). > > D. "Send a message to all members(M’) instructing them to connect to > leader(M)." > leader(M) can potentially change after sending this message. Should > this be "Send a messages to all members(M') to connect to members(M)? See > point G. below. > > Also, in step 7, why do you send leader(M') along with > <activate-config>message>? > > E. "3. Submit a reconfig(M’) operation to leader(M)" > What if leader(M) fails after receiving the request, but before > informing any other peer? How will the administrative client know whether to > retry or not? Should s retry if leader fails and should the client API > retry > if s fails? > > F. Regarding 3.a and 3.b. > > The algorithm mentions: > > "3. Otherwise: > a. Designate the most up-to-date server in connected quorum of M' to be > leader(M) > b. Stop processing new operations, return failure for any further ops > received" > > Why do we need to change the leader if leader(M) is not in M'? How about we > let > the leader perform the reconfig and at the end of phase 3 (while processing > retire) the leader will give up leadership. This will cause a new leader > election and one of the peer in M' will become the leader. Similarly, after > the > second phase, members(M') will stop following leader(M) if leader(M) is not > in > M'. I think this will be simpler. > > G. Online VS offline > > In your "Online vs offline" section, you mentioned that the offline > strategy is preferred. I think we can do reconfiguration online. > I pictured M' as modified OBSERVERS till the time the reconfiguration is > complete. > > a. Let new members(M') join the cluster as OBSERVERS. Based on the current > implementation, M' will essentially sync with the leader after becoming > OBSERVERS and it will be easier to leverage the RequestProcessor pipeline > for > reconfig. > > b. Once a majority of M' join the leader, the leader executes phase 1 as you > described. > > c. After phase 1., the leader changes the transaction logic a bit. Every > transaction after this point (including reconfig-start) will be sent to M > and > M'. Leader then waits for quorum(M) and quorum(M') to ack the transaction. > So > M' are not pure OBSERVERS as we think today. However, only a member of M can > become a leader until reconfig succeeds. Also, M' - (M n M') cannot serve > client requests until reconfiguration is complete. By doing a transaction on > both M and M' and waiting for the quorum of each set to ack, we keep > transfering > the state to both the configurations. > > d. After receiving ack for phase 2, the leader sends <switch-to-new-config> > transaction to M and M' (instead of sending retire just to M). > > After receiving this message, M' will close connections with (and reject > connections from) members not in M'. Members that are supposed to leave the > cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new > leader(M') will be elected. > > Let me know what you think about this. > > H. What happens if a majority of M' keep failing during reconfig? > > M={A, B}, M'={A, C, D, E}. What if D, E fail? > > Failure of a majority of M' will permanently stall reconfig. While this is > less likely to happen, I think ZooKeeper should handle this > automatically. After a few retries, we should abort reconfig. Otherwise, we > could disrupt a running cluster and we will never be able to recover without > manual intervention. If the majority fails after phase 1, then this would > mean > sending a <abort-reconfig, version(M), M'> to M. > > Of course, one can argue - what if majority of M' fail after phase 3? So not > sure if this is an overkill, but I feel we should handle this case. > > I. "Old reconfiguration request" > > a. We should use ZAB > b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to > work. This bug causes logs to diverge if ZK leader fails before sending > PROPOSALs to followers (see > http://www.mail-archive.com/[email protected]/msg00403.html). > > Because of this bug we could run into the following scenario: > - A peer B that was leader when reconfig(M') was issued will have reconfig > M' > in its transaction log. > - A peer C that became leader after B's failure, can have reconfig(M'') in > its > log. > - Now, if B and C fail (say both reboot), then the outcome of reconfig will > depend on which node takes leadership. If B becomes a leader, then out > come > is M'. If C becomes a leader, then outcome is M'' > > J. Policies/ sanity checks > - Should we allow removal of a peer in a 3 node configuration? How about in > a > 2-node configuration? > > Can you please add section numbers to the design paper? It will be easier to > refer to the text by numbers. > > Thanks again for working on this. We are about to release the first version > of > our product using ZooKeeper, which uses static configuration. Our next > version > is heavily dependent on dynamic membership. We have resource allocated at > work > that can dedicate time for implementing this feature for our next release > and > we are interested in contributing to it. We will be happy to chip in from > our > side to help with the implementation. > > Regards, > -Vishal > > On Thu, Mar 10, 2011 at 2:19 AM, Alexander Shraer > <[email protected]>wrote: > >> Hi, >> >> >> >> I'm working on adding support for dynamic membership changes in Zookeeper >> clusters (ZOOKEEPER-107). I've added a proposed algorithm and some >> discussion here: >> >> >> >> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership >> >> >> >> Any comments and suggestions are very welcome. >> >> >> >> Best Regards, >> >> Alex >> >
