[ 
https://issues.apache.org/jira/browse/CASSANDRA-16139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203517#comment-17203517
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-16139 at 9/28/20, 8:48 PM:
------------------------------------------------------------------------------

The token ring is problematic for us implementors; wrap around is a minor 
headache, but much more important is how on earth you safely perform multiple 
overlapping range movements - it's basically impossible, as you don't know 
which will necessarily complete, and so do not know who will end up owning the 
replica. Overlapping range movements even as a concept is bad, and unique to 
the token ring conceptualisation.

Bounded ranges of ownership - whether as tokens or keys - that nodes are 
explicitly assigned tois the correct approach.  Defining the membership of each 
key/token range explicitly prevents these complicated scenarios - a node 
joining can only possibly replicate these keys, and nothing any other node is 
doing will modify that.  These can be defined per keyspace to permit greater 
replication flexibility, and importantly safe modifications to replication 
factor without new machinery.

Automatic healing is something that I would hope to be built atop these 
features, but could in principle be built atop a token ring, just super 
painfully and probably with many bugs (like all of our range movements up to 
today).

Note that this work necessarily overlaps with safe schema changes, the two are 
intertwined.  I'll leave other thoughts on the topic for another day - some 
time in the next 2-3 months I will published my white paper on the topic.


was (Author: benedict):
The token ring is problematic for us implementors; wrap around is a minor 
headache, but much more important is how on earth you safely perform multiple 
overlapping range movements - it's basically impossible, as you don't know 
which will necessarily complete, and so do not know who will end up owning the 
replica. Overlapping range movements even as a concept is bad, and unique to 
the token ring conceptualisation.

Bounded ranges of ownership - whether as tokens or keys - that nodes are 
explicitly assigned tois the correct approach.  Defining the membership of each 
key/token range explicitly prevents these complicated scenarios - a node 
joining can only possibly replicate these keys, and nothing any other node is 
doing will modify that.  These can be defined per keyspace to permit greater 
replication flexibility, and importantly safe modifications to replication 
factor without new machinery.

Automatic healing is something that I would hope to be built atop these 
features, and could in principle be built atop a token ring, just super 
painfully and probably with many bugs (like all of our range movements up to 
today).

Note that this work necessarily overlaps with safe schema changes, the two are 
intertwined.  I'll leave other thoughts on the topic for another day - some 
time in the next 2-3 months I will published my white paper on the topic.

> Safe Ring Membership Protocol
> -----------------------------
>
>                 Key: CASSANDRA-16139
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16139
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>            Priority: Normal
>
> This ticket presents a practical protocol for performing safe ring membership 
> updates in Cassandra. This protocol will enable reliable concurrent ring 
> membership updates.
> The proposed protocol is composed of the following macro-steps:
> *PROPOSE:* An initiator node wanting to make updates to the current ring 
> structure (such as joining, leaving the ring or changing token assignments) 
> must propose the change to the other members of the ring (cohort).
> *ACCEPT:* Upon receiving a proposal the other ring members determine if the 
> change is compatible with their local version of the ring, and if so, they 
> promise to accept the change proposed by the initiator. The ring members do 
> not accept proposals if they had already promised to honor another proposal, 
> to avoid conflicting ring membership updates.
> *COMMIT:* Once the initiator receives acceptances from all the nodes in the 
> cohort, it commits the proposal by broadcasting the proposed ring delta via 
> gossip. Upon receiving these changes, the other members of the cohort apply 
> the delta to their local version of the ring and broadcast their new computed 
> version via gossip. The initiator concludes the ring membership update 
> operation by checking that all nodes agree on the new proposed version.
> *ABORT:* A proposal not accepted by all members of the cohort may be 
> automatically aborted by the initiator or manually via a command line tool.
> For simplicity the protocol above requires that all nodes are up during the 
> proposal step, but it should be possible to optimize it to require only a 
> quorum of nodes up to perform ring changes.
> A python pseudo-code of the protocol is available 
> [here|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-safe_ring_membership-py].
> With the abstraction above it becomes very simple to perform ring change 
> operations:
>  * 
> [bootstrap|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-bootstrap-py]
>  * 
> [replace|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-replace-py]
>  * 
> [move|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-move-py]
>  * [remove 
> node|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_node-py]
>  * [remove 
> token|https://gist.github.com/pauloricardomg/1930c8cf645aa63387a57bb57f79a0f7#file-remove_token-py]
> h4. Token Ring Data Structure
> The token ring data structure can be seen as a [Delta State Replicated Data 
> Type|https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs]
>  (Delta CRDT) containing the state of all (virtual) nodes in the cluster 
> where updates to the ring are operations on this CRDT.
> Each member publishes its latest local accepted state (delta state) via 
> gossip and the union of all delta states comprise the global ring state. The 
> delta state must be commutative and idempotent to ensure all nodes will 
> eventually reach the same global state no matter the order received.
> The content-addressed fingerprint of the global ring state uniquely 
> identifies the ring version and provides a simple way to verify agreement 
> between nodes in the cluster. Any change to the ring membership must be 
> agreed using the described protocol, ensuring that both conditions are met:
>  * All nodes have the same current view of the cluster before the update 
> (verified via the ring version fingerprint).
>  * All nodes have agreed to make the exact same update and not accept any 
> other update before the current proposed update is committed or aborted.
> h4. Ring Convergence Time
> Assuming there are no network partitions, the ring membership convergence 
> time will be dominated by the commit step since that is performed via gossip 
> broadcast.
> The gossip broadcast is performed by sending the ring delta to the seed 
> nodes, since other nodes will contact seed nodes with a #seeds / #nodes 
> probability. This will define an upper bound for the maximum time it takes to 
> propagate a ring update that was accepted by all members of the ring.
> On a cluster with 10% of the nodes as seeds, it’s guaranteed that a ring 
> membership update operation will not take much longer than 10 seconds with 
> the current gossip interval of 1 second. A simple way to reduce this upper 
> bound is to make cohort acceptors gossip more frequently with seeds when 
> there are pending ring membership updates.
> h4. Failure Modes
>  - Concurrent Proposals:
>  -- Concurrent initiators will not gather sufficient promises from all cohort 
> members, and thus, will not succeed. Unsuccessful proposals may be cleaned up 
> manually or automatically via the ABORT step.
>  - Single proposal: Initiator Failure
>  -- Initiator is partitioned before sending proposal
>  --- Initiator will not gather sufficient promises from all cohort members, 
> and thus, the ring membership update will not succeed.
>  -- Initiator is partitioned after proposal is accepted by a subset of cohort 
> members:
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  -- Initiator is partitioned after proposal is accepted by all cohort members
>  --- No other proposal will succeed before the partition is healed or it’s 
> manually ABORTED.
>  Initiator is partitioned after proposal is committed
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip. No other proposal will be accepted before all nodes commit the 
> current proposal.
>  - Single proposal: Cohort member Failure
>  -- Cohort member is partitioned before receiving proposal
>  --- Initiator will not gather sufficient promises from all cohort members. 
> No other proposal will succeed before the partition is healed because they 
> will not be able to reach this cohort member.
>  -- Cohort member is partitioned after accepting proposal
>  --- If all other members of the cohort accepted the proposal, the initiator 
> will COMMIT the proposal. No other proposal will succeed before the partition 
> is healed because they will not be able to reach this cohort member.
>  -- Cohort member is partitioned after committing proposal
>  --- If a single member of the cohort committed the proposal, all other 
> members will eventually commit it since they will receive the update via 
> gossip from the initiator. No other proposal will be accepted before all 
> nodes commit the current proposal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to