[ https://issues.apache.org/jira/browse/CASSANDRA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Peter Schuller updated CASSANDRA-3829: -------------------------------------- Issue Type: Improvement (was: Bug) > make seeds *only* be seeds, not special in gossip > -------------------------------------------------- > > Key: CASSANDRA-3829 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3829 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Peter Schuller > Assignee: Peter Schuller > Priority: Minor > > First, a little bit of "framing" on how seeds work: > The concept of "seed hosts" makes fundamental sense; you need to > "seed" a new node with some information required in order to join a > cluster. Seed hosts is the information Cassandra uses for this > purpose. > But seed hosts play a role even after the initial start-up of a new > node in a ring. Specifically, seed hosts continue to be gossiped to > separately by the Gossiper throughout the life of a node and the > cluster. > Generally, operators must be careful to ensure that all nodes in a > cluster are appropriately configured to refer to an overlapping set of > seed hosts. Strictly speaking this should not be necessary (see > further down though), but is the general recommendation. An > unfortunate side-effect of this is that whenever you are doing ring > management, such as replacing nodes, removing nodes, etc, you have to > keep in mind which nodes are seeds. > For example, if you bring a new node into the cluster, doing > everything right with token assignment and auto_bootstrap=true, it > will just enter the cluster without bootstrap - causing inconsistent > reads. This is dangerous. > And worse - changing the notion of which nodes are seeds across a > cluster requires a *rolling restart*. It can be argued that it should > actually be okay for nodes other than the one being fiddled with to > incorrectly treat the fiddled-with node as a seed node, but this fact > is highly opaque to most users that are not intimately familiar with > Cassandra internals. > This adds additional complexity to operations, as it introduces a > reason why you cannot view the ring as completely homogeneous, despite > the fundamental idea of Cassandra that all nodes should be equal. > Now, fast forward a bit to what we are doing over here to avoid this > problem: We have a zookeeper based systems for keeping track of hosts > in a cluster, which is used by our Cassandra client to discover nodes > to talk to. This works well. > In order to avoid the need to manually keep track of seeds, we wanted > to make seeds be automatically discoverable in order to eliminate as > an operational concern. We have implemented a seed provider that does > this for us, based on the data we keep in zookeeper. > We could see essentially three ways of plugging this in: > * (1) We could simply rely on not needing overlapping seeds and grab whatever > we have when a node starts. > * (2) We could do something like continually treat all other nodes as seeds > by dynamically changing the seed list (involves some other changes like > having the Gossiper update it's notion of seeds. > * (3) We could completely eliminate the use of seeds *except* for the very > specific purpose of initial start-up of an unbootstrapped node, and keep > using a static (for the duration of the node's uptime) seed list. > (3) was attractive because it felt like this was the original intent > of seeds; that they be used for *seeding*, and not be constantly > required during cluster operation once nodes are already joined. > Now before I make the suggestion, let me explain how we are currently > (though not yet in production) handling seeds and start-up. > First, we have the following relevant cases to consider during a normal > start-up: > * (a) we are starting up a cluster for the very first time > * (b) we are starting up a new clean node in order to join it to a > pre-existing cluster > * (c) we are starting up a pre-existing already joined node in a pre-existing > cluster > First, we proceeded on the assumption that we wanted to remove the use > of seeds during regular gossip (other than on initial startup). This > means that for the (c) case, we can *completely* ignore seeds. We > never even have to discover the seed list, or if we do, we don't have > to use them. > This leaves (a) and (b). In both cases, the critical invariant we want > to achieve is that we must have one or more *valid* seeds (valid means > for (b) that the seed is in the cluster, and for (a) that it is one of > the nodes that are part of the initial cluster setup). > In the (c) case the problem is trivial - ignore seeds. > In the (a) case, the algorithm is: > * Register with zookeeper as a seed > * Wait until we see *at least one* seed *other than ourselves* in zookeeper > * Continue regular start-up process with the seed list (with 1 or more seeds) > In the (b) case, the algorithm is: > * Wait until we see *at least one* seed in zookeeper > * Continue regular start-up process with the seed list (with 1 or more seeds) > * Once fully up (around the time we listen to thrift), register as a seed in > zookeeper > With the annoyance that you have to explicitly let Cassandra know that > "I am starting a cluster for the very first time from scratch", and > ignoring the problem of single node clusters (just to avoid > complicating this post further), this guarantees in both cases that > all nodes eventually see each other. > In the (a) case, all nodes except one are guaranteed to see the "one" > node. The "one" node is guaranteed to see one of the others. Thus - > convergence. > In the (b) case, it's simple - the new node is guaranteed to see one > or more nodes that are in the cluster - convergence. > The current status is that we have implemented the seed provider and > the start-up sequence works. But in order to simplify Cassandra (and > to avoid having to diverge), we propose that we take this to its > conclusion and officially make seeds only relevant on start-up, by > only ever gossiping to seeds when in pre-bootstrap mode during > start-up. > The perceived benefits are: > * Simplicity for the operator. All nodes are equal once joined; you can > almost forget completely about seeds. > * No rolling restarts or potential for footshooting a node into a cluster > without bootstrap because it happened to be a seed. > * Production clusters will suddenly start to actually *test* the gossip > protocol without relying on seeds. How sure are we that it even works, and > that phi conviction is appropriate and RING_DELAY is appropriate, given that > practical clusters tend to gossip to a random (among very few) seeds? This > change would make it so that we *always* gossip randomly to anyone in the > cluster, and there should be no danger that a cluster happens to hold > together because seeds are up - only to explode when they are not. > * It eliminates non-trivial concerns with automatic seed discover, > particularly when you want that seed discovery to be rack and DC aware. All > you care about it what was described above; if that seed happens to fail, we > simply fail to find the cluster and can abort start-up and it can be retried. > There is no need for "redundancy" in seeds. > Thoughts? Are seeds important (by design) in some way other than for seeding? > What do other people think about the implications of RING_DELAY etc? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira