[ 
https://issues.apache.org/jira/browse/CASSANDRA-16364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847465#comment-17847465
 ] 

Jon Haddad commented on CASSANDRA-16364:
----------------------------------------

Let me add some additional information.  This is partially based on what I've 
learned to fix the problem and partially from the accounts of others.

It appears that a token collision happened to a cluster _without_ using 
auto_bootstrap: false.  Two nodes existed in the ring owning conflicting 
tokens.  It appears that the cluster was running for months with a split brain, 
causing writes and reads to go to different sets of nodes depending the 
coordinator.  The operator is fairly certain they waited for several minutes 
between adding nodes but admits it's possible that a bug in the automation 
resulted in them joining close to the same time.  During a two month time 
period, some data was deleted, and the tombstones got GC'ed, and eventually 
read repair caused the original data to be resurrected.  

This is a pretty serious flaw in the design of deterministic token allocation.  
It's unsafe by design.  Adding jitter to the tokens by default will prevent 
data loss.  We can make a change to behavior in an existing release if it 
addresses a fundamental flaw in the design, especially when that flaw puts a 
cluster in a wildly unpredictable state.

> Joining nodes simultaneously with auto_bootstrap:false can cause token 
> collision
> --------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16364
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16364
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Membership
>            Reporter: Paulo Motta
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> While raising a 6-node ccm cluster to test 4.0-beta4, 2 nodes chosen the same 
> tokens using the default {{allocate_tokens_for_local_rf}}. However they both 
> succeeded bootstrap with colliding tokens.
> We were familiar with this issue from CASSANDRA-13701 and CASSANDRA-16079, 
> and the workaround to fix this is to avoid parallel bootstrap when using 
> {{allocate_tokens_for_local_rf}}.
> However, since this is the default behavior, we should try to detect and 
> prevent this situation when possible, since it can break users relying on 
> parallel bootstrap behavior.
> I think we could prevent this as following:
> 1. announce intent to bootstrap via gossip (ie. add node on gossip without 
> token information)
> 2. wait for gossip to settle for a longer period (ie. ring delay)
> 3. allocate tokens (if multiple bootstrap attempts are detected, tie break 
> via node-id)
> 4. broadcast tokens and move on with bootstrap



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to