[
https://issues.apache.org/jira/browse/ZOOKEEPER-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685929#comment-16685929
]
Michael Han commented on ZOOKEEPER-3188:
----------------------------------------
A couple of comments on the high level design:
* Did we consider the compatibility requirement here? Will the new
configuration format be backward compatible? One concrete use case is if a
customer upgrades to new version with this multiple address per server
capability but wants to roll back without rewriting the config files to older
version.
* Did we evaluate the impact of this feature on existing server to server
mutual authentication and authorization feature (e.g. ZOOKEEPER-1045 for
Kerberos, ZOOKEEPER-236 for SSL), and also the impact on operations? For
example how to configure Kerberos principals and / or SSL certs per host given
multiple potential IP address and / or FQDN names per server?
* Could we provide more details on expected level of support with regards to
dynamic reconfiguration feature? Examples would be great - for example: we
would support adding, removing, or updating server address that's appertained
to a given server via dynamic reconfiguration, and also the expected behavior
in each case. For example, adding a new address to an existing ensemble member
should not cause any disconnect / reconnect but removing an in use address of a
server should cause a disconnect. Likely the dynamic reconfig API / CLI / doc
should be updated because of this.
> Improve resilience to network
> -----------------------------
>
> Key: ZOOKEEPER-3188
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3188
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Ted Dunning
> Priority: Major
>
> We propose to add network level resiliency to Zookeeper. The ideas that we
> have on the topic have been discussed on the mailing list and via a
> specification document that is located at
> [https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing]
> That document is copied to this issue which is being created to report the
> results of experimental implementations.
> h1. Zookeeper Network Resilience
> h2. Background
> Zookeeper is designed to help in building distributed systems. It provides a
> variety of operations for doing this and all of these operations have rather
> strict guarantees on semantics. Zookeeper itself is a distributed system made
> up of cluster containing a leader and a number of followers. The leader is
> designated in a process known as leader election in which a majority of all
> nodes in the cluster must agree on a leader. All subsequent operations are
> initiated by the leader and completed when a majority of nodes have confirmed
> the operation. Whenever an operation cannot be confirmed by a majority or
> whenever the leader goes missing for a time, a new leader election is
> conducted and normal operations proceed once a new leader is confirmed.
>
> The details of this are not important relative to this discussion. What is
> important is that the semantics of the operations conducted by a Zookeeper
> cluster and the semantics of how client processes communicate with the
> cluster depend only on the basic fact that messages sent over TCP connections
> will never appear out of order or missing. Central to the design of ZK is
> that a server to server network connection is used as long as it works to use
> it and a new connection is made when it appears that the old connection isn't
> working.
>
> As currently implemented, however, each member of a Zookeeper cluster can
> have only a single address as viewed from some other process. This means,
> absent network link bonding, that the loss of a single switch or a few
> network connections could completely stop the operations of a the Zookeeper
> cluster. It is the goal of this work to address this issue by allowing each
> server to listen on multiple network interfaces and to connect to other
> servers any of several addresses. The effect will be to allow servers to
> communicate over redundant network paths to improve resiliency to network
> failures without changing any core algorithms.
> h2. Proposed Change
> Interestingly, the correct operations of a Zookeeper cluster do not depend on
> _how_ a TCP connection was made. There is no reason at all not to advertise
> multiple addresses for members of a Zookeeper cluster.
>
> Connections between members of a Zookeeper cluster and between a client and a
> cluster member are established by referencing a configuration file (for
> cluster members) that specifies the address of all of the nodes in a cluster
> or by using a connection string containing possible addresses of Zookeeper
> cluster members. As soon as a connection is made, any desired authentication
> or encryption layers are added and the connection is handed off to the client
> communications layer or the server to server logic.
> This means that the only thing that actually needs to change to allow
> Zookeeper servers to be accessible on multiple networks is a change in the
> server configuration file format to allow the multiple addresses to be
> specified and to update the code that establishes the TCP connection to make
> use of these multiple addresses. No code changes are actually needed on the
> client since we can simply supply all possible server addresses. The client
> already has logic for selecting a server address at random and it doesn’t
> really matter if these addresses represent synonyms for the same server. All
> that matters is that _some_ connection to a server is established.
> h2. Configuration File Syntax Change
> The current Zookeeper syntax looks like this:
>
> tickTime=2000
> dataDir=/var/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1=zoo1:2888:3888
> server.2=zoo2:2888:3888
> server.3=zoo3:2888:3888
>
> The only lines that matter for this discussion are the last three. These
> specify the addresses for each of the servers that are part of the Zookeeper
> cluster as well as the port numbers used for the servers to talk to each
> other.
>
> I propose that the current syntax of these lines be augmented to allow a
> comma delimited list of addresses. For the current example, we might have
> this:
>
> server.1=zoo1-net1:2888:3888,zoo1-net2:2888:3888
> server.2=zoo2-net1:2888:3888,zoo2-net2:2888:3888
> server.3=zoo3-net1:2888:3888
>
> The first two servers are available via two different addresses, presumably
> on separate networks while the third server only has a single address. In
> practice, we would probably specify multiple addresses for all the servers,
> but that isn’t necessary for this proposal. There is work ongoing to improve
> and generalize the syntax for configuring Zookeeper clusters. As that work
> progresses, it will be necessary to figure out appropriate extensions to
> allow multiple addresses in the new and improved syntax. Nothing blocks the
> current proposal from being implemented in current form and adapted later for
> the new syntax.
>
> When a server tries to connect to another server, it would simply shuffle the
> available addresses at random and try to connect using successive addresses
> until a connection succeeds or all addresses have been tried.
>
> The complete syntax for server lines in a Zookeeper configuration file in BNF
> is
>
> <server-line> ::= "server."<integer> "=" <address-spec>
> <address-spec> ::= <server-address>[<client-address>]
> <server-address> ::= <address>:<port1>:<port2>[:<role>]
> <client-address> ::= [;[<client address>:]<client port>
>
> After this change, the syntax would look like this:
>
> <server-line> ::= "server."<integer> "=" <address-list>
> <address-list> ::= <address-spec>[,<address-list>]
> <address-spec> ::= <server-address>[<client-address>]
> <server-address> ::= <address>:<port1>:<port2>[:<role>]
> <client-address> ::= [;[<client address>:]<client port>
>
> h2. Dynamic Reconfiguration
> From version 3.5, Zookeeper has the ability to change the configuration of
> the cluster dynamically. This can involve the atomic change of any of the
> configuration parameters that are dynamically configurable. These include,
> notably for the purposes here, the addresses of the servers in the cluster.
> In order to simplify this, the configuration file post 3.5 is split into
> static configuration that cannot be changed on the fly and dynamic
> configuration that can be changed. When a new configuration is committed by
> the cluster, the dynamic configuration file is simply over-written and used.
>
> This means that extending the configuration file syntax to support multiple
> addresses is sufficient to support dynamic reconfiguration.
> h2. Client Connections
> When client connections are initially made, the client library is given a
> list of servers to contact. Servers are selected at random until a connection
> is made or the patience of the library implementers is exhausted. This
> requires no changes to support multiple network links per server except
> insofar that servers with more network connections will wind up with more
> client connections unless some action is taken. What will be done is to find
> the server with the most addresses and add duplicates of some address for
> every other server until every server is mentioned the same number of times.
> For cases where all servers have identical numbers of network connections,
> this will cause no change. It is expected that this will only arise in normal
> situations as a transient condition while a cluster is being reconfigured or
> if some servers are added to a cluster temporarily during maintenance
> operations.
>
> More interesting is the fact that when a connection is made to a Zookeeper
> cluster, the server responds with a list of the servers in the cluster. We
> will need to arrange that the list contains all available address in the
> Zookeeper cluster, but will not need to make any other changes. As mentioned
> before, some addresses might be duplicated to make sure that all servers have
> equal probability of being selected by a server.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)