[jira] Commented: (ZOOKEEPER-107) Allow dynamic changes to server cluster membership

Vishal K (JIRA) Mon, 14 Mar 2011 11:43:53 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13006565#comment-13006565
 ]


Vishal K commented on ZOOKEEPER-107:
------------------------------------



Great! Thanks for the design. I have a few suggestions/questions below.

A. Bootstrapping the cluster
    The algorithm assumes that a leader exists prior to doing a reconfig.
So it looks like the reconfig() API is not intended to use for bootstrapping
the cluster. How about we define a API for initializing the cluster? 

Instead of relying on the current method of setting the initial configuration
in the config file, we should probably also have to define a join(M) (or
init(M)) API. When a peer receives this request, it will try to connect to the
specified peers.  During bootstrap peers will connect to each other (as they do
now) and elect a leader.

B. Storing membership in znode and updating client (a tangential topic to this
design).
    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
clients to fetch server list. I am not opposed to the URI based approach,
however, that shouldn't be our only approach. URI based approach requires extra
resources (e.g., fault tolerant web service or shared storage for file, etc).
In certain environments it may not be feasible to have such a resource.
Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
Store the membership information in a znode and let the ZooKeeper server inform
the clients of the changed configuration.

C. Locating most recent config on servers
    The URI based approach will be more helpful to servers. For example, if
A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
online it won't be able to locate the most recent config. In this case, A can
query the URI. Second approach is to ask the leader to try to periodically
send the membership (at least to nodes that are down).

D. "Send a message to all members(M’) instructing them to connect to leader(M)."
    leader(M) can potentially change after sending this message. Should
this be "Send a messages to all members(M') to connect to members(M)? See
point G. below.

Also, in step 7, why do you send leader(M') along with 
<activate-config>message>?

E. "3. Submit a reconfig(M’) operation to leader(M)"
    What if leader(M) fails after receiving the request, but before
informing any other peer? How will the administrative client know whether to
retry or not?  Should s retry if leader fails and should the client API retry
if s fails? 

F. Regarding 3.a and 3.b.

The algorithm mentions:

"3. Otherwise:
   a. Designate the most up-to-date server in connected quorum of M' to be 
leader(M)
   b. Stop processing new operations, return failure for any further ops 
received"

Why do we need to change the leader if leader(M) is not in M'? How about we let
the leader perform the reconfig and at the end of phase 3 (while processing
retire) the leader will give up leadership. This will cause a new leader
election and one of the peer in M' will become the leader. Similarly, after the
second phase, members(M') will stop following leader(M) if leader(M) is not in
M'. I think this will be simpler.

G. Online VS offline

    In your "Online vs offline" section, you mentioned that the offline
strategy is preferred. I think we can do reconfiguration online.
I pictured M' as modified OBSERVERS till the time the reconfiguration is 
complete.

a. Let new members(M') join the cluster as OBSERVERS. Based on the current
implementation, M' will essentially sync with the leader after becoming
OBSERVERS and it will be easier to leverage the RequestProcessor pipeline for
reconfig.

b. Once a majority of M' join the leader, the leader executes phase 1 as you
described.

c. After phase 1., the leader changes the transaction logic a bit. Every
transaction after this point (including reconfig-start) will be sent to M and
M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction. So
M' are not pure OBSERVERS as we think today. However, only a member of M can
become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
client requests until reconfiguration is complete. By doing a transaction on
both M and M' and waiting for the quorum of each set to ack, we keep transfering
the state to both the configurations.

d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
transaction to M and M' (instead of sending retire just to M).

After receiving this message, M' will close connections with (and reject
connections from) members not in M'.  Members that are supposed to leave the
cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
leader(M') will be elected.

Let me know what you think about this.

H. What happens if a majority of M' keep failing during reconfig?

M={A, B}, M'={A, C, D, E}. What if D, E fail?

Failure of a majority of M' will permanently stall reconfig. While this is
less likely to happen, I think ZooKeeper should handle this
automatically. After a few retries, we should abort reconfig.  Otherwise, we
could disrupt a running cluster and we will never be able to recover without
manual intervention. If the majority fails after phase 1, then this would mean
sending a <abort-reconfig, version(M), M'> to M.

Of course, one can argue - what if majority of M' fail after phase 3? So not
sure if this is an overkill, but I feel we should handle this case.

I. "Old reconfiguration request"

a. We should use ZAB
b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
work. This bug causes logs to diverge if ZK leader fails before sending
PROPOSALs to followers (see
http://www.mail-archive.com/[email protected]/msg00403.html).

Because of this bug we could run into the following scenario:
- A peer B that was leader when reconfig(M') was issued will have reconfig M'
  in its transaction log.
- A peer C that became leader after B's failure, can have reconfig(M'') in its
  log.
- Now, if B and C fail (say both reboot), then the outcome of reconfig will
  depend on which node takes leadership. If B becomes a leader, then out come
is M'. If C becomes a leader, then outcome is M''

J. Policies/ sanity checks
- Should we allow removal of a peer in a 3 node configuration? How about in a
2-node configuration?

Can you please add section numbers to the design paper? It will be easier to
refer to the text by numbers.

Thanks again for working on this. We are about to release the first version of
our product using ZooKeeper, which uses static configuration. Our next version
is heavily dependent on dynamic membership. We will be happy to chip in from our
side to help with the implementation.


> Allow dynamic changes to server cluster membership
> --------------------------------------------------
>
>                 Key: ZOOKEEPER-107
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-107
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Patrick Hunt
>            Assignee: Henry Robinson
>         Attachments: SimpleAddition.rtf
>
>
> Currently cluster membership is statically defined, adding/removing hosts 
> to/from the server cluster dynamically needs to be supported.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (ZOOKEEPER-107) Allow dynamic changes to server cluster membership

Reply via email to