[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102676#comment-17102676 ]
Sylvain Lebresne commented on CASSANDRA-15789: ---------------------------------------------- I had a quick look at those commits, and agrees about the fix in `LegacyLayout`. And I have no strong objections on the 2 other parts, but wanted to remark 2 points: - regarding the elimination of duplicates on iterator coming from `LegacyLayout`: the patch currently merge the duplicates rather silently. What if we have another bug in `LegacyLayout` for which row duplication is only one sign, but that also lose data? Are we sure we won't regret not failing on what would be an unknown bug? - Regarding the duplicate check on all reads, I "think" this could have a measurable impact on performance for some workloads. Which isn't a reason not to add it, but as this impact all reads and will go in "stable" versions, do we want to run a few benchmarks to quantify this? Or have a way to disable the check? > Rows can get duplicated in mixed major-version clusters and after full upgrade > ------------------------------------------------------------------------------ > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable > Reporter: Aleksey Yeschenko > Assignee: Marcus Eriksson > Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map<text, text>, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ----+----+------------ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='00000001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then B, then a third one from the end of B to the end of A. To make this > simpler, if > // we detect that case we transform the 1st and 3rd tombstone so they don't > end in the middle > // of a row (which is still correct). > {code} > {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such > a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, > and mistakenly starts a new row, while in the middle of an existing one: (see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org