[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Fabien Rousseau (JIRA) Mon, 02 May 2016 07:29:41 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266682#comment-15266682
 ]


Fabien Rousseau commented on CASSANDRA-11349:
---------------------------------------------

Great.

I just created a new patch (11349-2.1-v3.patch) where the 'update' method is 
empty in the validation tracker (in fact, it was a left-over from previous 
attempts and should have been empty ).

The main difference for example between the "update" method from the regular 
compaction, and the addRangeTombstone from the validation compaction is the 
returned value. In the latter case, it's wether the RT is superseded/shadowed 
by another previously met tombstone.
To be honest, I did not managed to factorize both of them without compromising 
readability even if they share some similarities.

I'm a bit skeptical with ValidationCompactionTracker extending 
RegularCompactionTracker because RegularCompaction has more fields 
(unwrittenTombstones, atomCount) which would not be used by the 
ValidationCompactionTracker (and it feels odd to have unused fields).
Doing it the other side, ie RegularCompactionTracker extending 
ValidationCompactionTracker, seemed a better fit (RegularCompaction reuses the 
comparator and openedTombstones), adds more fields, but there is not much to 
win: only the isDeleted method is in common...
Thus the interface did not seem a bad choice: implementations are less coupled 
(and could diverge more in the future if needed).
But this can be changed if needed (I just wanted to explain design choices and 
am not opposed to inheritance)

I agree that this way of doing is a "leaky abstraction". Nevertheless, the main 
idea is to have a patch doing minimal architectural changes to the current code 
base (did not want to refactor anything) to avoid introducing bugs. Moreover, 
because the 3.X and 3.0.X are not affected, this will stay in the 2.1.X and 
2.2.X branches (and won't be technical debt).
Anyway, it's more a pragmatic solution than an elegant one (and evidently, I am 
open to a more elegant solution).

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> ---------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11349
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
>              Labels: repair
>             Fix For: 2.1.x, 2.2.x
>
>         Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

Reply via email to