[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321446#comment-15321446
 ] 

Branimir Lambov commented on CASSANDRA-11349:
---------------------------------------------

Does this really solve the problem with the test you mentioned? Putting the 
tombstones through {{RangeTombstoneList}} will normalize them, but they may not 
be issued in the right position, i.e. the RTL solution only works if the data 
contains only tombstones. For example, the 
{{\["b:d:\!","b:\!",1463656272792,"t",1463731877\]}} part from the test above 
gets issued before a potential token that may come before {{b:d:!}}.

The test needs to be extended to include live tokens, for example by adding 
each of
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'b', 'a', 1)
{code}
or
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'd', 'a', 1)
{code}
or
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'e', 'a', 1)
{code}
after the deletions.

The RTL solution will break (in different ways) for at least two of the above. 
It also has performance implications that I am not really happy to take. A 
proper solution is to either fully replicate what RTL does in the tombstone 
tracker (which may be not be worth it so late in the lifespan of 2.1 and 2.2), 
or make the tombstone tracker wrap around an RTL (which may be inefficient and 
is still somewhat tricky).

If (as Fabien's testing seems to imply) doing the digest update as 
serialization solves the majority of the differences and repair pain, I would 
prefer to stop there.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to