[jira] [Commented] (CASSANDRA-19644) deterministic token allocation combined with slow gossip propogation can lead to data loss

Jon Haddad (Jira) Fri, 17 May 2024 15:40:11 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847459#comment-17847459
 ]


Jon Haddad commented on CASSANDRA-19644:
----------------------------------------

Ah.  I didn't see CASSANDRA-16364.  My preferred solution is different than 
what's in there, I'll drop my comment on that one and close this out.

> deterministic token allocation combined with slow gossip propogation can lead 
> to data loss
> ------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19644
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19644
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jon Haddad
>            Priority: Normal
>
> I've seen several cases now where starting nodes within a somewhat short time 
> window (about a minute) when using the default allocation tokens for RF leads 
> to token conflicts.  Unfortunately this can easily go undetected with medium 
> to large clusters.
> When this happens, different nodes in the cluster will have different 
> understandings of the topology of the cluster.  I've seen this go unnoticed 
> in a production environment for several months, leading to data loss, data 
> resurrection, and other odd behavior.
> We should apply some randomness to the tokens to ensure that even in the case 
> of 1 nodes starting at once, it's still unlikely that they will ever have a 
> conflict.  Applying a random() value to the token value between - 2^8 and 2^8 
> makes this statistically very, very unlikely that we'll ever have a collision 
> while also preserving the balance of token distribution in the ring.  In the 
> case of 2 nodes starting at the same time, the operator will have weird token 
> distribution instead of data loss.
>  
> {noformat}
> INFO  [GossipStage:1] 2024-05-17 22:16:12,333 StorageService.java:3006 - 
> Nodes /10.0.2.134:7000 and cassandra1/10.0.1.61:7000 have the same token 
> -1938510198161598815. /10.0.2.134:7000 is the new owner
> INFO  [GossipStage:1] 2024-05-17 22:16:12,333 StorageService.java:3006 - 
> Nodes /10.0.2.134:7000 and cassandra1/10.0.1.61:7000 have the same token 
> -3478858378222500629. /10.0.2.134:7000 is the new owner
> INFO  [GossipStage:1] 2024-05-17 22:16:12,333 StorageService.java:3006 - 
> Nodes /10.0.2.134:7000 and cassandra1/10.0.1.61:7000 have the same token 
> 3562748272064835315. /10.0.2.134:7000 is the new owner
> INFO  [GossipStage:1] 2024-05-17 22:16:12,333 StorageService.java:3006 - 
> Nodes /10.0.2.134:7000 and cassandra1/10.0.1.61:7000 have the same token 
> 8085185010613503278. /10.0.2.134:7000 is the new owner{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19644) deterministic token allocation combined with slow gossip propogation can lead to data loss

Reply via email to