[ 
https://issues.apache.org/jira/browse/CASSANDRA-9667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727993#comment-14727993
 ] 

Jason Brown commented on CASSANDRA-9667:
----------------------------------------

bq. Do you think this approach can be extended to allow more consistent schema 
changes like changing RF or altering a table?

I think that would be more of a function of the underlying paxos/LWT/consensus 
alg (which may or may not be the existing LWT, still considering and debating), 
more so than the overall membership changes. But I would hope the consensus alg 
work here would apply to other efforts, as well!

bq. Also, for your manual join, what kind of information can we give to users 
to allow them to evaluate the pending transaction?

Initially I was only thinking of showing the minimal info: IP addr (or other 
host info) and possibly any token info (like if the node is replace another, or 
the operator is explicitly setting tokens). That being said, we could display 
any amount of info we choose - the initial set was only bounded by my 
imagination :). However, I really do like your idea about being able to 
determine the amount of data to be streamed to the new node - something like 
that should be a reasonably simple calculation and certainly helpful for 
operators.

Note: I'm still ironing out the protocol and transition points, but let me post 
the updates in a short while.



> strongly consistent membership and ownership
> --------------------------------------------
>
>                 Key: CASSANDRA-9667
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9667
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>              Labels: LWT, membership, ownership
>             Fix For: 3.x
>
>
> Currently, there is advice to users to "wait two minutes between adding new 
> nodes" in order for new node tokens, et al, to propagate. Further, as there's 
> no coordination amongst joining node wrt token selection, new nodes can end 
> up selecting ranges that overlap with other joining nodes. This causes a lot 
> of duplicate streaming from the existing source nodes as they shovel out the 
> bootstrap data for those new nodes.
> This ticket proposes creating a mechanism that allows strongly consistent 
> membership and ownership changes in cassandra such that changes are performed 
> in a linearizable and safe manner. The basic idea is to use LWT operations 
> over a global system table, and leverage the linearizability of LWT for 
> ensuring the safety of cluster membership/ownership state changes. This work 
> is inspired by Riak's claimant module.
> The existing workflows for node join, decommission, remove, replace, and 
> range move (there may be others I'm not thinking of) will need to be modified 
> to participate in this scheme, as well as changes to nodetool to enable them.
> Note: we distinguish between membership and ownership in the following ways: 
> for membership we mean "a host in this cluster and it's state". For 
> ownership, we mean "what tokens (or ranges) does each node own"; these nodes 
> must already be a member to be assigned tokens.
> A rough draft sketch of how the 'add new node' workflow might look like is: 
> new nodes would no longer create tokens themselves, but instead contact a 
> member of a Paxos cohort (via a seed). The cohort member will generate the 
> tokens and execute a LWT transaction, ensuring a linearizable change to the 
> membership/ownership state. The updated state will then be disseminated via 
> the existing gossip.
> As for joining specifically, I think we could support two modes: auto-mode 
> and manual-mode. Auto-mode is for adding a single new node per LWT operation, 
> and would require no operator intervention (much like today). In manual-mode, 
> however, multiple new nodes could (somehow) signal their their intent to join 
> to the cluster, but will wait until an operator executes a nodetool command 
> that will trigger the token generation and LWT operation for all pending new 
> nodes. This will allow us better range partitioning and will make the 
> bootstrap streaming more efficient as we won't have overlapping range 
> requests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to