[ 
https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-9667:
-----------------------------------
    Description: 
Currently, there is advice to users to "wait two minutes between adding new 
nodes" in order for new node tokens, et al, to propagate. Further, as there's 
no coordination amongst joining node wrt token selection, new nodes can end up 
selecting ranges that overlap with other joining nodes. This causes a lot of 
duplicate streaming from the existing source nodes as they shovel out the 
bootstrap data for those new nodes.

This ticket proposes creating a mechanism that allows strongly consistent 
membership and ownership changes in cassandra such that changes are performed 
in a linearizable and safe manner. The basic idea is to use LWT operations over 
a global system table, and leverage the linearizability of LWT for ensuring the 
safety of cluster membership/ownership state changes. This work is inspired by 
Riak's claimant module.

The existing workflows for node join, decommission, remove, replace, and range 
move (there may be others I'm not thinking of) will need to be modified to 
participate in this scheme, as well as changes to nodetool to enable them.

Note: we distinguish between membership and ownership in the following ways: 
for membership we mean "a host in this cluster and it's state". For ownership, 
we mean "what tokens (or ranges) does each node own"; these nodes must already 
be a member to be assigned tokens.

A rough draft sketch of how the 'add new node' workflow might look like is: new 
nodes would no longer create tokens themselves, but instead contact a member of 
a Paxos cohort (via a seed). The cohort member will generate the tokens and 
execute a LWT transaction, ensuring a linearizable change to the 
membership/ownership state. The updated state will then be disseminated via the 
existing gossip.

As for joining specifically, I think we could support two modes: auto-mode and 
manual-mode. Auto-mode is for adding a single new node per LWT operation, and 
would require no operator intervention (much like today). In manual-mode, 
however, multiple new nodes could (somehow) signal their their intent to join 
to the cluster, but will wait until an operator executes a nodetool command 
that will trigger the token generation and LWT operation for all pending new 
nodes. This will allow us better range partitioning and will make the bootstrap 
streaming more efficient as we won't have overlapping range requests.

  was:
Currently, there is advice to users to "wait two minutes between adding new 
nodes" in order for new node tokens, et al, to propagate. Further, as there's 
no coordination amongst joining node wrt token selection, new nodes can end up 
selecting ranges that overlap with other joining nodes. This causes a lot of 
duplicate streaming from the existing source nodes as they shovel out the 
bootstrap data for those new nodes.

This ticket proposes creating a mechanism that allows strongly consistent 
membership and ownership changes in cassandra such that changes are performed 
in a linearizable and safe manner. The basic idea is to use LWT operations over 
a global system table, and leverage the linearizability of LWT for ensuring the 
safety of cluster membership/ownership state changes. This will necessitate 
changes to the existing workflows for node join, decommission, remove, replace, 
and range move (there may be others I'm not thinking of), as well as changes to 
nodetool.

Note: we distinguish between membership and ownership in the following ways: 
for membership we mean "a host in this cluster and it's state". For ownership, 
we mean "what tokens (or ranges) does each node own"; these nodes must already 
be a member to be assigned tokens.

A rough draft sketch of how the 'add new node' workflow might look like is: new 
nodes would no longer create tokens themselves, but instead contact a member of 
a Paxos cohort (via a seed). The cohort member will generate the tokens and 
execute a LWT transaction, ensuring a linearizable change to the 
membership/ownership state. The updated state will then be disseminated via the 
existing gossip.

As for joining specifically, I think we could support two modes: auto-mode and 
manual-mode. Auto-mode is for adding a single new node per LWT operation, and 
would require no operator intervention (much like today). In manual-mode, 
however, multiple new nodes could (somehow) signal their their intent to join 
to the cluster, but will wait until an operator executes a nodetool command 
that will trigger the token generation and LWT operation for all pending new 
nodes. This will allow us better range partitioning and will make the bootstrap 
streaming more efficient as we won't have overlapping range requests.


> strongly consistent membership and ownership
> --------------------------------------------
>
>                 Key: CASSANDRA-9667
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9667
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>              Labels: LWT, membership, ownership
>             Fix For: 3.x
>
>
> Currently, there is advice to users to "wait two minutes between adding new 
> nodes" in order for new node tokens, et al, to propagate. Further, as there's 
> no coordination amongst joining node wrt token selection, new nodes can end 
> up selecting ranges that overlap with other joining nodes. This causes a lot 
> of duplicate streaming from the existing source nodes as they shovel out the 
> bootstrap data for those new nodes.
> This ticket proposes creating a mechanism that allows strongly consistent 
> membership and ownership changes in cassandra such that changes are performed 
> in a linearizable and safe manner. The basic idea is to use LWT operations 
> over a global system table, and leverage the linearizability of LWT for 
> ensuring the safety of cluster membership/ownership state changes. This work 
> is inspired by Riak's claimant module.
> The existing workflows for node join, decommission, remove, replace, and 
> range move (there may be others I'm not thinking of) will need to be modified 
> to participate in this scheme, as well as changes to nodetool to enable them.
> Note: we distinguish between membership and ownership in the following ways: 
> for membership we mean "a host in this cluster and it's state". For 
> ownership, we mean "what tokens (or ranges) does each node own"; these nodes 
> must already be a member to be assigned tokens.
> A rough draft sketch of how the 'add new node' workflow might look like is: 
> new nodes would no longer create tokens themselves, but instead contact a 
> member of a Paxos cohort (via a seed). The cohort member will generate the 
> tokens and execute a LWT transaction, ensuring a linearizable change to the 
> membership/ownership state. The updated state will then be disseminated via 
> the existing gossip.
> As for joining specifically, I think we could support two modes: auto-mode 
> and manual-mode. Auto-mode is for adding a single new node per LWT operation, 
> and would require no operator intervention (much like today). In manual-mode, 
> however, multiple new nodes could (somehow) signal their their intent to join 
> to the cluster, but will wait until an operator executes a nodetool command 
> that will trigger the token generation and LWT operation for all pending new 
> nodes. This will allow us better range partitioning and will make the 
> bootstrap streaming more efficient as we won't have overlapping range 
> requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to