[ https://issues.apache.org/jira/browse/CASSANDRA-16364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847465#comment-17847465 ]
Jon Haddad commented on CASSANDRA-16364: ---------------------------------------- Let me add some additional information. This is partially based on what I've learned to fix the problem and partially from the accounts of others. It appears that a token collision happened to a cluster _without_ using auto_bootstrap: false. Two nodes existed in the ring owning conflicting tokens. It appears that the cluster was running for months with a split brain, causing writes and reads to go to different sets of nodes depending the coordinator. The operator is fairly certain they waited for several minutes between adding nodes but admits it's possible that a bug in the automation resulted in them joining close to the same time. During a two month time period, some data was deleted, and the tombstones got GC'ed, and eventually read repair caused the original data to be resurrected. This is a pretty serious flaw in the design of deterministic token allocation. It's unsafe by design. Adding jitter to the tokens by default will prevent data loss. We can make a change to behavior in an existing release if it addresses a fundamental flaw in the design, especially when that flaw puts a cluster in a wildly unpredictable state. > Joining nodes simultaneously with auto_bootstrap:false can cause token > collision > -------------------------------------------------------------------------------- > > Key: CASSANDRA-16364 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16364 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership > Reporter: Paulo Motta > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > While raising a 6-node ccm cluster to test 4.0-beta4, 2 nodes chosen the same > tokens using the default {{allocate_tokens_for_local_rf}}. However they both > succeeded bootstrap with colliding tokens. > We were familiar with this issue from CASSANDRA-13701 and CASSANDRA-16079, > and the workaround to fix this is to avoid parallel bootstrap when using > {{allocate_tokens_for_local_rf}}. > However, since this is the default behavior, we should try to detect and > prevent this situation when possible, since it can break users relying on > parallel bootstrap behavior. > I think we could prevent this as following: > 1. announce intent to bootstrap via gossip (ie. add node on gossip without > token information) > 2. wait for gossip to settle for a longer period (ie. ring delay) > 3. allocate tokens (if multiple bootstrap attempts are detected, tie break > via node-id) > 4. broadcast tokens and move on with bootstrap -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org