[ 
https://issues.apache.org/jira/browse/CASSANDRA-2434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112905#comment-13112905
 ] 

paul cannon commented on CASSANDRA-2434:
----------------------------------------


Ok, prospective approach to totally safe range movements:

Operational rules:
* Cassandra will not allow two range motion operations (move, bootstrap, decom) 
at the same time on the same node.
* When a range motion operation is already pending, User should refrain from 
starting another range motion operation (if either motion operation overlaps 
the arc-of-effect of the other) until the gossip info about the first change 
has propagated to all affected nodes. (This is more simply approximated by the 
"two minute rule".)
* Every point in the tokenspace has the same number of natural endpoints, and 
they're ordered the same from the perspective of all nodes (is this an ok 
assumption?).
* It is User's responsibility to make sure that the right streaming source 
nodes are available. If they're not, the range motion operation may fail.

Procedure:
* For any motion involving range _R_, there will be a stream from endpoint 
_EP_source_ to endpoint _EP_dest_. Given the same information about what range 
motion operations are pending (_TokenMetadata_) and the range _R_, there is a 
bijection from _EP_source_ to _EP_dest_, shared by all nodes in the ring.
* Procedure to determine _EP_source_ from _EP_dest_:
** Let _REP_current_ be the existing (ordered) list of natural endpoints for 
_R_.
** Let _TM_future_ be a clone of the current _TokenMetadata_, but with all 
ongoing bootstraps, moves, and decoms resolved and completed.
** Let _REP_future_ be the list of (ordered) natural endpoints for _R_ 
according to _TM_future_.
** Let _EPL_entering_ be the list of endpoints in _REP_future_ which are not in 
_REP_current_ (preserving their order in _REP_future_).
** Let _EPL_leaving_ be the list of endpoints in _REP_current_ which are not in 
_REP_future_ (preserving their order in _REP_current_).
** _EPL_entering_ and _EPL_leaving_ are of the same length.
** Let _Pos_ be the position/index of _EP_dest_ in _EPL_entering_.
** Let _EP_source_ be the endpoint at position _Pos_ in _EPL_leaving_.
* Intuitively, this is the same as the rule expressed earlier in this ticket 
(stream from the node you'll replace), but also handles other ongoing range 
movements in the same token arc.
* These rules can be pretty trivially inverted to determine _EP_dest_ from 
_EP_source_.
* When any node gets gossip about a range motion occurring with its token 
arc-of-effect, it calculates (or recalculates) the streams in which it should 
be involved. Any ongoing streams which are no longer necessary are canceled, 
and any newly necessary streams are instigated.

I tried to construct a ruleset without that last rearrange-ongoing-streams 
rule, but it ended up with a pretty complicated set of extra restrictions, and 
a more complicated set of procedures than this.

This set of rules might look complicated, but I think it should be fairly 
straightforward to implement, and may even end up simpler overall than our 
current code.

Note that this procedure even maintains the consistency guarantee in cases like:

* In an RF=3 cluster with nodes A, E, and F, bootstrap B, C, and D in quick 
succession (E streams to B, F streams to C, A streams to D)
* In an RF=3 cluster with nodes A, C, and E, bootstrap B, D, and F, and 
decommission A, C, and E, all in quick succession (A streams to B, C streams to 
D, E streams to F)
* In an RF=3 cluster with nodes A, B, C, D, and E, decommission B and C in 
quick succession (B streams to D, C streams to E)

> range movements can violate consistency
> ---------------------------------------
>
>                 Key: CASSANDRA-2434
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2434
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Peter Schuller
>            Assignee: paul cannon
>             Fix For: 1.0.1
>
>         Attachments: 2434-3.patch.txt, 2434-testery.patch.txt
>
>
> My reading (a while ago) of the code indicates that there is no logic 
> involved during bootstrapping that avoids consistency level violations. If I 
> recall correctly it just grabs neighbors that are currently up.
> There are at least two issues I have with this behavior:
> * If I have a cluster where I have applications relying on QUORUM with RF=3, 
> and bootstrapping complete based on only one node, I have just violated the 
> supposedly guaranteed consistency semantics of the cluster.
> * Nodes can flap up and down at any time, so even if a human takes care to 
> look at which nodes are up and things about it carefully before 
> bootstrapping, there's no guarantee.
> A complication is that not only does it depend on use-case where this is an 
> issue (if all you ever do you do at CL.ONE, it's fine); even in a cluster 
> which is otherwise used for QUORUM operations you may wish to accept 
> less-than-quorum nodes during bootstrap in various emergency situations.
> A potential easy fix is to have bootstrap take an argument which is the 
> number of hosts to bootstrap from, or to assume QUORUM if none is given.
> (A related concern is bootstrapping across data centers. You may *want* to 
> bootstrap to a local node and then do a repair to avoid sending loads of data 
> across DC:s while still achieving consistency. Or even if you don't care 
> about the consistency issues, I don't think there is currently a way to 
> bootstrap from local nodes only.)
> Thoughts?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to