[ 
https://issues.apache.org/jira/browse/CASSANDRA-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Schuller updated CASSANDRA-3829:
--------------------------------------

    Issue Type: Improvement  (was: Bug)
    
> make seeds *only* be seeds, not special in gossip 
> --------------------------------------------------
>
>                 Key: CASSANDRA-3829
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3829
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> First, a little bit of "framing" on how seeds work:
> The concept of "seed hosts" makes fundamental sense; you need to
> "seed" a new node with some information required in order to join a
> cluster. Seed hosts is the information Cassandra uses for this
> purpose.
> But seed hosts play a role even after the initial start-up of a new
> node in a ring. Specifically, seed hosts continue to be gossiped to
> separately by the Gossiper throughout the life of a node and the
> cluster.
> Generally, operators must be careful to ensure that all nodes in a
> cluster are appropriately configured to refer to an overlapping set of
> seed hosts. Strictly speaking this should not be necessary (see
> further down though), but is the general recommendation. An
> unfortunate side-effect of this is that whenever you are doing ring
> management, such as replacing nodes, removing nodes, etc, you have to
> keep in mind which nodes are seeds.
> For example, if you bring a new node into the cluster, doing
> everything right with token assignment and auto_bootstrap=true, it
> will just enter the cluster without bootstrap - causing inconsistent
> reads. This is dangerous.
> And worse - changing the notion of which nodes are seeds across a
> cluster requires a *rolling restart*. It can be argued that it should
> actually be okay for nodes other than the one being fiddled with to
> incorrectly treat the fiddled-with node as a seed node, but this fact
> is highly opaque to most users that are not intimately familiar with
> Cassandra internals.
> This adds additional complexity to operations, as it introduces a
> reason why you cannot view the ring as completely homogeneous, despite
> the fundamental idea of Cassandra that all nodes should be equal.
> Now, fast forward a bit to what we are doing over here to avoid this
> problem: We have a zookeeper based systems for keeping track of hosts
> in a cluster, which is used by our Cassandra client to discover nodes
> to talk to. This works well.
> In order to avoid the need to manually keep track of seeds, we wanted
> to make seeds be automatically discoverable in order to eliminate as
> an operational concern. We have implemented a seed provider that does
> this for us, based on the data we keep in zookeeper.
> We could see essentially three ways of plugging this in:
> * (1) We could simply rely on not needing overlapping seeds and grab whatever 
> we have when a node starts.
> * (2) We could do something like continually treat all other nodes as seeds 
> by dynamically changing the seed list (involves some other changes like 
> having the Gossiper update it's notion of seeds.
> * (3) We could completely eliminate the use of seeds *except* for the very 
> specific purpose of initial start-up of an unbootstrapped node, and keep 
> using a static (for the duration of the node's uptime) seed list.
> (3) was attractive because it felt like this was the original intent
> of seeds; that they be used for *seeding*, and not be constantly
> required during cluster operation once nodes are already joined.
> Now before I make the suggestion, let me explain how we are currently
> (though not yet in production) handling seeds and start-up.
> First, we have the following relevant cases to consider during a normal 
> start-up:
> * (a) we are starting up a cluster for the very first time
> * (b) we are starting up a new clean node in order to join it to a 
> pre-existing cluster
> * (c) we are starting up a pre-existing already joined node in a pre-existing 
> cluster
> First, we proceeded on the assumption that we wanted to remove the use
> of seeds during regular gossip (other than on initial startup). This
> means that for the (c) case, we can *completely* ignore seeds. We
> never even have to discover the seed list, or if we do, we don't have
> to use them.
> This leaves (a) and (b). In both cases, the critical invariant we want
> to achieve is that we must have one or more *valid* seeds (valid means
> for (b) that the seed is in the cluster, and for (a) that it is one of
> the nodes that are part of the initial cluster setup).
> In the (c) case the problem is trivial - ignore seeds.
> In the (a) case, the algorithm is:
> * Register with zookeeper as a seed
> * Wait until we see *at least one* seed *other than ourselves* in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> In the (b) case, the algorithm is:
> * Wait until we see *at least one* seed in zookeeper
> * Continue regular start-up process with the seed list (with 1 or more seeds)
> * Once fully up (around the time we listen to thrift), register as a seed in 
> zookeeper
> With the annoyance that you have to explicitly let Cassandra know that
> "I am starting a cluster for the very first time from scratch", and
> ignoring the problem of single node clusters (just to avoid
> complicating this post further), this guarantees in both cases that
> all nodes eventually see each other.
> In the (a) case, all nodes except one are guaranteed to see the "one"
> node. The "one" node is guaranteed to see one of the others. Thus -
> convergence.
> In the (b) case, it's simple - the new node is guaranteed to see one
> or more nodes that are in the cluster - convergence.
> The current status is that we have implemented the seed provider and
> the start-up sequence works. But in order to simplify Cassandra (and
> to avoid having to diverge), we propose that we take this to its
> conclusion and officially make seeds only relevant on start-up, by
> only ever gossiping to seeds when in pre-bootstrap mode during
> start-up.
> The perceived benefits are:
> * Simplicity for the operator. All nodes are equal once joined; you can 
> almost forget completely about seeds.
> * No rolling restarts or potential for footshooting a node into a cluster 
> without bootstrap because it happened to be a seed.
> * Production clusters will suddenly start to actually *test* the gossip 
> protocol without relying on seeds. How sure are we that it even works, and 
> that phi conviction is appropriate and RING_DELAY is appropriate, given that 
> practical clusters tend to gossip to a random (among very few) seeds? This 
> change would make it so that we *always* gossip randomly to anyone in the 
> cluster, and there should be no danger that a cluster happens to hold 
> together because seeds are up - only to explode when they are not.
> * It eliminates non-trivial concerns with automatic seed discover, 
> particularly when you want that seed discovery to be rack and DC aware. All 
> you care about it what was described above; if that seed happens to fail, we 
> simply fail to find the cluster and can abort start-up and it can be retried. 
> There is no need for "redundancy" in seeds.
> Thoughts? Are seeds important (by design) in some way other than for seeding? 
> What do other people think about the implications of RING_DELAY etc?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to