[ 
https://issues.apache.org/jira/browse/CASSANDRA-17991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624582#comment-17624582
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-17991:
---------------------------------------------------

[{color:#4a6ee0}Brandon 
Williams{color}|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=brandon.williams]{color:#0e101a}
 I've just modified the reproducible steps. Consistency would not matter here, 
the reason for consistency _ALL_ is to ensure that all the three nodes have the 
tombstone.{color}

 

{color:#0e101a}The thing I am trying to highlight is that a legit (non-deleted) 
record is present in multiple SSTables (SSTable1 and SSTable2), and now if we 
delete the record, then the tombstone would be added to the third SSTable 
(SSTable3). {color}{color:#0e101a}Now, if we bootstrap/decommission any node, 
then streaming would not be atomic for SSTable1, SSTable2, and SSTable3. So, it 
could be possible that SSTable1 and SSTable3 have been streamed out, but not 
SSTable2 is not yet. Under this condition, if the newly bootstrapping node 
performs compaction, then the tombstone marker would be gone. And later on, 
SSTable2 is streamed, which makes the record alive, and this is unexpected 
behavior.{color}

> Possible data inconsistency during bootstrap/decommission
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-17991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17991
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Jaydeepkumar Chovatia
>            Priority: Normal
>
> I am facing one corner case in which the deleted data resurrects.
> tl;dr: This could be because when we stream all the SSTables for a given 
> token range to the new owner, then they are not sent atomically, so the new 
> owner could do compaction on the partially received SSTables, which might 
> remove the tombstones.
>  
> Here are the reproducible steps:
> +*Prerequisite*+ # Three nodes Cassandra cluster n1, n2, and n3 (C* version 
> 3.0.27)
>  # 
> {code:java}
> CREATE KEYSPACE KS1 WITH replication = {'class': 'NetworkTopologyStrategy', 
> 'dc1': '3'};
> CREATE TABLE KS1.T1 (
>     key int,
>     c1 int,
>     c2 int,
>     c3 int
>     PRIMARY KEY (key)
> ) WITH CLUSTERING ORDER BY (key ASC)
>  AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND gc_grace_seconds = 864000;
> {code}
>  
> *Reproducible Steps*
>  * Day1: Insert a new record followed by {_}nodetool flush on n1, n2, and 
> n3{_}. A new SSTable ({_}SSTable1{_}) will be created.
> {code:java}
> INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day2: Insert the same record again followed by _nodetool flush_ {_}on n1, 
> n2, and n3{_}{_}.{_} A new SSTable ({_}SSTable2{_}) will be created
> {code:java}
>  INSERT INTO KS1.T1 (key, c1, c2, c3) values (1, 10, 20, 30){code}
>  * Day3: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> {code}
>  * Day4: Delete the record followed by _nodetool flush_ on n1, n2, and n3
> {code:java}
> CONSISTENCY ALL; DELETE FROM KS1.T1 WHERE key = 1; {code}
>  * Day5: Here is the data layout on SSTables on n1, n2, and n3 
> {code:java}
> SSTable1:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable2:
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "liveness_info" : { "tstamp" : "2022-10-15T00:00:00.000001Z"},
>         "cells" : [
>             { "name" : "c1", "value" : 10 },
>             { "name" : "c2", "value" : 20 },
>             { "name" : "c3", "value" : 30 },
>           ]
> }
> .....
> SSTable3 (Tombstone):
> {
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 900
>     },
>     "rows" : [
>         "type" : "row",  
>         "position" : 10,
>         "deletion_info" : { "marked_deleted" : "2022-10-19T00:00:00.000001Z", 
> "local_delete_time" : "2022-10-19T00:00:00.000001Z" },
>         "cells" : [ ]
> }
> {code}
>  * Day20: Nothing happens for more than 10 days. Let's say the data layout on 
> SSTables on n1, n2, and n3 is the same as Day5
>  * Day20: A new node (n4) joins the ring, and it is going to be responsible 
> for key "1". Let's say it streams the data from n3. The node _n3_ is supposed 
> to stream out SSTable1, SSTable2, and SSTable3, but it does not happen 
> atomically as per the streaming algorithm. Let's consider a scenario in that 
> _n4_ receives SSTable1 and SSTable3, but not yet SSTable2, and _n4_ compacts 
> SSTable1 and SSTable3. In this case, _n4_ would purge the key "1". So at this 
> time, there are no traces of key "1" on {_}n4{_}. After some time, SSTable2 
> is streamed, and at this time it will stream the key "1" as well.
>  * Day20: _n4_ becomes normal 
> {code:java}
> Query on n4:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // 1 | 10 | 20 | 30 <-- A record is returned
> Query on n1:
> $> CONSISTENCY LOCAL_ONE; SELECT * FROM KS1.T1 WHERE key = 1;
> // <empty> //no output{code}
>  
> Does this make sense?
> *Possible Solution*
>  * One of the solutions is to maybe not purge tombstones while there are 
> token range movements in the ring



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to