[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2017-01-02 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792783#comment-15792783
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


I've looked at some metrics today for one of our clusters that has been updated 
to 2.1.16 a couple of weeks ago. We used to see tens of thousands of sstables 
getting streamed each night during repairs with many GBs.

With 2.1.16 the number of streamed sstables went down to almost none. Thanks 
for fixing this to everyone involved! :)

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Branimir Lambov
>  Labels: repair
> Fix For: 2.1.16, 2.2.8
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-07-19 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383900#comment-15383900
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Here is some quick feedback: patch is deployed on production for more than a 
week and it diminished a lot the streaming during repairs.

A patched version of C* 2.1.14 (containing the patch) has been deployed on all 
of our production clusters more than a week ago and it works well.
There are still a few differences during repairs, but a lot less than before, 
and this is "manageable".

Thanks to all of you for your help.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Branimir Lambov
>  Labels: repair
> Fix For: 2.1.16, 2.2.8
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-07-04 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361168#comment-15361168
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

Tests look ok, all failures are either failing on base or failed very recently.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-07-04 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361016#comment-15361016
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

Rebased patch here:
|[2.1|https://github.com/blambov/cassandra/tree/11349]|[utests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-dtest/]|
|[2.2|https://github.com/blambov/cassandra/tree/11349-2.2]|[utests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-2.2-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-2.2-dtest/]|


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-06-28 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352747#comment-15352747
 ] 

Sylvain Lebresne commented on CASSANDRA-11349:
--

Had a look here, and I'm more comfortable with sticking to [~blambov] approach. 
For 2.1 and 2.2, we're now in "only critical bug fixes" and running things 
through RTL definitively changes things too much for my comfort. That imply I'm 
fine not fixing every possible problems if that gets us too far (especially 
since it's properly fixed in 3.0 and not that many people seems to have 
reported this). And Branimir's approach seems to be making a good enough impact 
in practice.

So [~blambov], could you rebase your patch for 2.1 and 2.2 and run CI. After 
which, if tests are good, I'm +1 committing. 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-06-16 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333457#comment-15333457
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Just to let you know that we packaged the patch done by Branimir (as it is the 
one that have more chances to be included mainstream).

We restored one cluster (3 nodes, 100GB of data per node, affected table is 
25GB) from a snapshot on new hardware, and did a full repair. So far, so good, 
not much differences are found for the affected table but this was expected 
because repairs are not run for a few months (around a hundred VS a few hundred 
of thousands before).

We will continue testing by recreating all of our clusters, and then, deploy it 
on our production (and I'll let you know once this is done).

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-06-10 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324477#comment-15324477
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


You're correct by pointing out that live columns can prevent fully normalizing 
all RTs using the RTL approach in patch v4. It will still be more accurate than 
without RTL consolidation, but the question is if the additional complexity is 
worth it. If you'd be more comfortable going with the patch initially suggested 
by yourself, I'm confident that this will still be a big improvement. 


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-06-08 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321446#comment-15321446
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

Does this really solve the problem with the test you mentioned? Putting the 
tombstones through {{RangeTombstoneList}} will normalize them, but they may not 
be issued in the right position, i.e. the RTL solution only works if the data 
contains only tombstones. For example, the 
{{\["b:d:\!","b:\!",1463656272792,"t",1463731877\]}} part from the test above 
gets issued before a potential token that may come before {{b:d:!}}.

The test needs to be extended to include live tokens, for example by adding 
each of
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'b', 'a', 1)
{code}
or
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'd', 'a', 1)
{code}
or
{code}
INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'e', 'a', 1)
{code}
after the deletions.

The RTL solution will break (in different ways) for at least two of the above. 
It also has performance implications that I am not really happy to take. A 
proper solution is to either fully replicate what RTL does in the tombstone 
tracker (which may be not be worth it so late in the lifespan of 2.1 and 2.2), 
or make the tombstone tracker wrap around an RTL (which may be inefficient and 
is still somewhat tricky).

If (as Fabien's testing seems to imply) doing the digest update as 
serialization solves the majority of the differences and repair pain, I would 
prefer to stop there.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-06-02 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312087#comment-15312087
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


I've now attached a patch for the last mentioned implementation as 
{{11349-2.1-v4.patch}} and {{11349-2.2-v4.patch}} to the ticket.

Test results are as follows (reported failures cannot be reproduced locally and 
seem to be unrelated to me):

||2.1||2.2||
|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]|
|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]|
|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]|

Anyone willing to take another look and actually commit a patch for this issue? 
I've pushed my WIP branch 
[here|https://github.com/spodkowinski/cassandra/commits/WIP2-11349] with 
individual commits that might help during the review.


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 
> 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-31 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308051#comment-15308051
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Thanks Stefan.

So if I understand well, your latest branch does not changes how SSTables are 
serialized on disk (by using 2 specialized serializers: one for compaction and 
one for validation) but still solves all cases (or at least all known cases).

Any chance that this patch can be included in 2.1.15 ?

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-20 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293719#comment-15293719
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


I've now created a new patch version 
[here|https://github.com/spodkowinski/cassandra/commit/c8601f8cd3921e754bcbe8c9362cf3d2e7072e1e]
  that basically combines both of your ideas of doing the digest updates in the 
serializer and using {{RangeTombstonesList}} to normalize RT intervals. Tests 
look good, feel free to add your own. [~blambov], can you think of any further 
cases that would not be covered by this approach? 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-20 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293709#comment-15293709
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Ok, it appears that the initial idea by [~blambov] is sufficient (after having 
done some basic testing for our 4th cluster).

Nevertheless, I'm surprised that we seems to be the only one affected by this 
issue. Maybe it's because it took us some time to realize it and investigate 
it, and there was no clear sign apart from big streams during repairs + data 
set size increasing too fast. So this may explain why not many people reported 
it, but there may be others affected out in the wild. That's why it's probably 
best to try to fix most of it (if it's not possible to fix it entirely), but I 
also understand that the less changes there are, the less risky it is...

So I'm good with it either partially fixed or mostly fixed. 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-20 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293041#comment-15293041
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

[~blambov] We have 4 clusters impacted by this bug, and for 3 out of 4, what 
you have in mind works.
I still need to verify for the 4th one. I'll try to verify this today.
Regarding the 3.0, migrating 60 nodes is not something done easily.

[~spo...@gmail.com] Yes, there are 3 RT on node2, because, in memory, RT are 
stored in a RangeTombstoneList (then serialized). The RangeTombstoneList 
automatically split the tombstones which are overlapping.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-19 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291211#comment-15291211
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


I've been debuging the latest mentioned error case using the following cql/ccm 
statements and a local 2 node cluster.

{code}
create keyspace ks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
use ks;
CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float,
 PRIMARY KEY (c1, c2, c3)
) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 
'false'};
DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'c';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b';
ccm node1 flush
DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' 
AND c3 = 'd';
ccm node1 flush
{code}

Timestamps have been added for easier tracking of the specific tombstone in the 
debugger.

ColmnIndex.Builder.buildForCompaction() will add tombstones in the following 
order to the tracker:

*Node1*

{{1463656272792: c1 = 'a' AND c2 = 'b'}}
First RT, added to unwritten + opened tombstones

{{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}}
Overshadowed by RT added before while being older at the same time. Will not be 
added and simply ignored.

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}}
Overshaded by first and only RT added to opened so far, but newer and will thus 
be added to unwritten+opened

We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for 
message digest.


*Node2*

{{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START)
First RT, added to unwritten + opened tombstones

{{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END)
comparision of EOC flag (Tracker:251) of previously added RT will cause having 
it removed from the opened list (Tracker:258). Afterwards the current RT will 
be added to unwritten + opened.

{{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color})
Gets compared with prev. added RT, which supersedes the current one and thus 
stays in the list. Will again be added to unwritten + opened list.

We end up with 3 unwritten RTs, including 1463656272792 twice.

I still haven't been able to exactly pinpoint why the reducer will be called 
twice with the same TS, but since [~blambov] explicitly mentioned that 
possibility, I guess it's intended behavior (but why? :)). 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-19 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290648#comment-15290648
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

There will be cases where this {{RangeTombstoneList}} solution is not 
sufficient (e.g. inserting {{c1 = 'a' AND c2 = 'b' AND c3 = 'a'}} at the end of 
the test above).

Is it imperative that we fix all scenarios here if 3.0 has the proper solution?

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-17 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287685#comment-15287685
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Ok, this seems a better approach (with this approach RT are added to the 
tracker through the ColumnIndex.add method).
I had some time to test it on a dev environment and repair did not find any 
difference (which is a good thing).

Regarding the case that is not working correctly, I think the solution is to 
use a RangeTombstoneList before writing RangeTombstones.

The current implementation Tracker.writeUnwrittenTombstones(...) is:
{noformat}
for (RangeTombstone rt : unwrittenTombstones)
{
size += writeTombstone(rt, out, atomSerializer);
}
{noformat}
And should be replaced by:
 {noformat}
RangeTombstoneList rtl = new RangeTombstoneList(comparator, 
unwrittenTombstones.size());
for (RangeTombstone rt : unwrittenTombstones)
{
rtl.add(rt);
}
for (RangeTombstone rt : rtl)
{
size += writeTombstone(rt, out, atomSerializer);
}
{noformat}
I haven't tested this but it should work.
The explanation for this is the following:
 - on node1, due to the flushes, each RT is written in its own SSTable
 - on node2, because all RTs are kept in memory, they're kept in a 
RangeTombstoneList. This RangeTombstoneList will keep non overlapping RTs.

During repair, on node1, RTs are merged but are kept as is (ie some RTs can be 
overlapped) while on node2, they can't.
By using the RangeTombstoneList before serializing the unwritten RT, no RT can 
overlap another RT.

Note: doing the change above will also change the way RT are serialized during 
normal compactions...

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-11 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280088#comment-15280088
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


Thanks for the clarification. It's really helpful to understand the intention 
of how those parts are suppose to work together. 

The serializer approach seems to be a good idea how to handle this, but  there 
are still 
[cases|https://github.com/spodkowinski/cassandra-dtest/blob/b110685bceddbcb63ebc744ba54a25cb268f2478/repair_tests/repair_test.py#L438:L451]
 \[1\] not handled correctly. I'm going to take a closer look to understand 
why. I'd also like to do some more testing for potential digest mismatch storms 
during rolling upgrades, but wouldn't expect any blockers so far. 

\[1\] nosetests 
repair_tests/repair_test.py:TestRepair.shadowed_range_tombstone_digest_parallel_repair_test


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-10 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278032#comment-15278032
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

Not precisely: One part of the problem is that we cannot ensure that e.g. the 
same range tombstone will not come twice from the same sstable (in which case 
{{MergeIterator}} would issue to separate {{getReduced}} calls) or from two 
different sstables (in which case {{MergeIterator}} would call {{getReduced}} 
once with both) or some complex combination of these. Another is that while 
compaction uses the tracker to identify when a tombstone is redundant and can 
be omitted, {{getReduced}} does not have that information at the time it 
processes that tombstone because the covering tombstone has not arrived yet.

The tracker can properly resolve these situations, but it can't do it without 
delaying which causes the necessity for abusing the serializer.

The reducer only adds RTs to the tracker if it would not return them in the 
output for some reason (e.g. expiration), the point being to always pass on the 
full stream of RTs; the order should not be affected by it choosing to do that.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-10 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277977#comment-15277977
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


To quickly sum up the current behavior.. {{ColumnIndex.Builder}} is created for 
each {{LazilyCompactedRow.update()}} call. The builder will iterate through all 
atoms produced by the {{MergeIterator}} and uses a {{RangeTombstone.Tracker}} 
instance for tombstone normalization. Tombstones will be added to the tracker 
from {{Builder.add()}} and by {{LCR.Reducer.getReduced()}}, which in turn will 
be called once for all atoms for the same column as considered by 
{{onDiskAtomComparator}}. 

[~blambov], so what you're saying is that we can't be sure that the 
{{MergeIterator}} will always be able to provide deterministic ordered values, 
as write order may be different and we therefor cannot simply iterate through 
the reducer to create a correct digest. 

What I'm a bit concerned about while trying to understand Branimir's approach 
is that at some point {{getReduced()}} will add the RT to the tracker while in 
another scenario the RT will be added later and will cause the serializer be 
called differently as well. Or to put this in other words, if we can't be sure 
about the reducer returning deterministic ordered values, won't this effect the 
tracker and digest calculation in the builder as well?

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-05 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272427#comment-15272427
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

I have something like 
[this|https://github.com/apache/cassandra/compare/trunk...blambov:11349] in 
mind.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-04 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270572#comment-15270572
 ] 

Branimir Lambov commented on CASSANDRA-11349:
-

As I see it neither solution will be sufficient. A lot of the visible effects 
of the problem come as a side effect of CASSANDRA-7953, but there are some 
underlying issues that are only really solved in 3.0 by the new tombstone 
handling from CASSANDRA-8099.

Whether we change {{onDiskAtomComparator}} or not, we will still get disordered 
or multiple equal range tombstones from a single source as that's how they are 
written in the sstables. {{MergeIterator}} will not combine equal entries from 
the same source, even if it did and everything was written using the 
{{onDiskAtomComparator}} (which I don't believe to be the case), it is still in 
the wrong order for resolving which tombstones can be deleted without delaying 
their processing.

In other words the problem cannot be solved by changing the reducer; we can, 
however, do it if we change {{update}} to follow closely or, better still, 
_call_ {{IndexBuilder.buildForCompaction}} and make the builder accept a 
prepared atom serializer (or some subinterface) instead of an output file, and 
update the digest in the calls to that serializer.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-03 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268627#comment-15268627
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


Sounds reasonable and I agree that the code changes should be less invasive as 
possible. We're talking about 2.x so we should avoid heavy refactoring. 
Mentioned class design could possibly still be improved, but that depends where 
to go from here..

To wrap up available patch options:

1) {{1349-2.1.patch}} with 2 changed lines would address the issue initially 
described
2) {{11349-2.1-v3.patch}} introduces a bit more changes but will also create 
correct digests for shadowed range tombstones and cells (see 
[dtest|https://github.com/spodkowinski/cassandra-dtest/blob/CASSANDRA-11349/repair_tests/repair_test.py#L425])

Any opinions on this except from Fabian's and mine? It would be good to get 
some feedback from someone how would be actually willing to commit something 
like this.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-02 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266682#comment-15266682
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Great.

I just created a new patch (11349-2.1-v3.patch) where the 'update' method is 
empty in the validation tracker (in fact, it was a left-over from previous 
attempts and should have been empty ).

The main difference for example between the "update" method from the regular 
compaction, and the addRangeTombstone from the validation compaction is the 
returned value. In the latter case, it's wether the RT is superseded/shadowed 
by another previously met tombstone.
To be honest, I did not managed to factorize both of them without compromising 
readability even if they share some similarities.

I'm a bit skeptical with ValidationCompactionTracker extending 
RegularCompactionTracker because RegularCompaction has more fields 
(unwrittenTombstones, atomCount) which would not be used by the 
ValidationCompactionTracker (and it feels odd to have unused fields).
Doing it the other side, ie RegularCompactionTracker extending 
ValidationCompactionTracker, seemed a better fit (RegularCompaction reuses the 
comparator and openedTombstones), adds more fields, but there is not much to 
win: only the isDeleted method is in common...
Thus the interface did not seem a bad choice: implementations are less coupled 
(and could diverge more in the future if needed).
But this can be changed if needed (I just wanted to explain design choices and 
am not opposed to inheritance)

I agree that this way of doing is a "leaky abstraction". Nevertheless, the main 
idea is to have a patch doing minimal architectural changes to the current code 
base (did not want to refactor anything) to avoid introducing bugs. Moreover, 
because the 3.X and 3.0.X are not affected, this will stay in the 2.1.X and 
2.2.X branches (and won't be technical debt).
Anyway, it's more a pragmatic solution than an elegant one (and evidently, I am 
open to a more elegant solution).

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-05-02 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266474#comment-15266474
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


I'm not sure introducing a new tracker interface is the best way to handle 
this. It took me a while to actually figure out the differences between the 
{{update}} implementations in both trackers, since for most parts it's sharing 
the same copied code. It would probably be better to have ValidationTracker 
subclass RegularCompactionTracker, add {{remove/addUnwrittenTombstone}} 
implemented empty for validation.

The {{addRangeTombstone}} semantics also look like a case of leaky abstractions 
to me. It's adding nothing at all for regular compaction, but serves as early 
exit path for validation. 

Good news is that the dtests and unit tests seem to pass with the patch. :)

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-28 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262171#comment-15262171
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Ok, I uploaded a new version of the patch (11349-2.1-v2.patch)

As said above, there are two Tracker implementations now: one for regular 
compaction and another one for validation compaction.
It solves both cases described here (the one in the ticket + the one in the 
comment of Stefan) and CASSANDRA-11477.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-22 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254670#comment-15254670
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Sorry for not being reactive lately, I'm rather busy atm...

I'd be more than happy[1] to see this patch in the next release.
I haven't tested it yet and probably can find some time next week to test it on 
a dev cluster if it can help.
Nevertheless, I won't be able to tell if it really worked because there will 
still have some mismatches (due to CASSANDRA-11477).

I have started working on a patch which should be able to handle both 
CASSANDRA-11477 and the last edge case.

What it basically does:
 - Tracker is now an interface
 - there are two implementations: one called RegularCompactionTracker and 
another one ValidationCompactionTracker
 - the ColumnIndexer.Builder has one more optional parameter : a boolean to 
know if it is built for validation
 - the RegularCompactionTracker is identical to the existing Tracker + one 
empty method
 - the ValidationCompactionTracker is similar to the existing Tracker but 
retain only opened tombstones (most methods are thus empty)
 - the Reducer slightly changed but its behaviour is the same regarding the 
regular compactions

I can share it if you're interested (code compiles but I still haven't tested 
it at all and plan to do it soon and share it after).

[1] Just to share more information: those issues are important to us, because a 
few of our clusters are impacted and a few days after filing the bug, we 
decided to temporarily stop repairing some tables (knowing that we could live 
with inconsistencies on those tables)  which were heavily impacted by those 
bugs (each repair increased disk occupancy by a few percent), and did a major 
compaction. This resulted in two to three times less disk occupancy (One table 
shrinked from 243GB to 79GB. Note that this was not due to tombstones 
reclaiming old data because, it's been nearly a month now, and the big SSTable 
resulting from the major compaction is still there but disk usage has not grown 
that much). 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-22 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253514#comment-15253514
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


Can we keep the conversation going to get this patch into the next 2.x release?

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-07 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230127#comment-15230127
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


[~frousseau], what makes things more complicated here is that changes to LCR 
will effect regular compactions as well. Adding all tombstones as expired in 
your {{11349-2.1-v2.patch}} will have unwanted side effects for regular 
compactions, e.g. try {{RangeTombstoneMergeTest}} with it.

I've now spend some time trying to make use of the RT.Tracker there but without 
much success. Adding non-expired range tombstones to the tracker from within 
LCR would cause corrupted sstables. Even creating an edge case just for 
validation compaction would not handle all potential TS shadowing scenarios and 
will probably cause more harm than good (and potential digest mismatch storms). 
I'm not even sure it's possible given the current iterative MergeIterator > 
LazilyCompactedRow > RT.Tracker interaction. 

I'm now at a point where I'd suggest to just stick with {{11349-2.1.patch}} 
unless someone else has a better idea how to solve this. I've updated the 
[dtest PR|https://github.com/riptano/cassandra-dtest/pull/881] with two of the 
described shadowing scenarios that will only work with 3.0+ even after the 
patch, if someone wants to give it a try.

Cassci results for {{11349-2.1.patch}}:

||2.1||2.2||
|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]|
|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]|
|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]|



> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1-v2.patch, 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-05 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227163#comment-15227163
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Using the RangeTombstone.Tracker can help in the situation described just above.

In fact, the RT should always update the tracker (see CASSANDRA-11477).
The trick here is to always considered it as "expired" in the tracker (even if 
not) so the tombstones are not accumulated during compaction (if expired the 
tracker keeps only the list of opened RTs and if not, it keeps all unwritten 
RTs, ie all RTs because it's a validation compaction...).

Having a look at the update method of the Tracker, it already check if the 
tombstone is superseded by another one (and don't add it as "opened" if 
superseded).

Thus, the v2 patch:
 - includes the previous patch
 - always update the tracker with the RT (considering it as expired even if 
not, just to not retain too many of them in memory, and because it's for 
validation, it's a read only and won't affect anything)
 - test if the RT was added in the openedTombstones list, and if that's not the 
case, skip it for digest.

I know that the patch may be a bit rough (at least on the "isLastOpened" 
method) but it is more to validate the approach first and did not want the 
patch to be too invasive (by modifying the returned value of the update method).

WDYT ?

Note: I have not yet tested it against our production data
Note2: Regarding the read-repair, this seems to be a different story and can't 
see anything for now that could explain those differences (will dig later on 
this as this is less urgent)

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-02 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222902#comment-15222902
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


It makes sense just to modify {{onDiskAtomComparator}}. Given the generic name 
I assumed the comparator is used in other places as well, but since its only 
used in {{LazyCompactedRow}} we can just change the patch as suggested and 
simply remove the timestamp tie-break behaviour in {{onDiskAtomComparator}}. 

As for regular compactions, I agree with Tyler that this should not effect 
compactions in a way that it does with validation compaction. Before the patch, 
{{LazyCompactedRow}} would not reduce both RTs but instead have 
{{ColumnIndex.buildForCompaction()}} iterate over both RTs and have them added 
to the {{RangeTombstone.Tracker}}. The tracker would merge them in a way 
{{LCR.Reducer.getReduced}} would after the patch. However, I’m not fully sure 
if there could be some other for more complex cases where this still would 
cause problems.

Although the patch should fix the described issue, the way we deal with RTs 
during validation compaction is still not ideal. The problem is that LCR lacks 
some handling of relationships between RTs compared to 
{{RangeTombstone.Tracker}}. If we create digests column by column, we get wrong 
results for shadowing tombstones not sharing the same intervals.

{noformat}
CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 2};
USE test_rt;
CREATE TABLE IF NOT EXISTS table1 (
c1 text,
c2 text,
c3 text,
c4 float,
PRIMARY KEY (c1, c2, c3)
) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 
'false'};
DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c';

ccm node1 flush

DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';

ccm node1 repair test_rt table1
{noformat}


In this case the (c1, c2, c3) RT will always be repaired after it has been 
compacted with (c1, c2) on any node. 
So I’m wondering if we shouldn’t take a more bold approach here than the patch 
does. 


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222365#comment-15222365
 ] 

Tyler Hobbs commented on CASSANDRA-11349:
-

I think there are a couple of things wrong with the current patch.

First, the comparator needs to continue to compare the full cell name first, 
and only break ties on range tombstones with {{compare(t1.max, t2.max)}}.

Second, I believe we should remove the timestamp tie-breaking behavior from the 
comparator in general, and not just for validation compactions.  In other 
words, I think we're doing the comparison incorrectly for all compactions right 
now.

We want the comparison to return 0 whenever range tombstones have equal names 
and ranges, even if they have different timestamps.  This will result in 
{{LazilyCompactedRow.Reducer.reduce()}} being called in one round with each of 
the tombstones that only differ in timestamp.  The logic in 
{{LCR.Reducer.reduce()}} already handles the case of multiple range tombstones 
with different timestamps by picking the one with the highest timestamp, so 
these will correctly be reduced to a single RT.  It looks like the current 
codebase will keep both range tombstones during a compaction, which isn't 
necessarily harmful, but is suboptimal.  For repair purposes, though, this is 
incorrect as it produces a different digest.

To summarize: I think all we need to do is remove the timestamp tie-breaking 
logic from the existing comparator.

[~slebresne] should double-check my logic, though.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Richard Low (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222330#comment-15222330
 ] 

Richard Low commented on CASSANDRA-11349:
-

I'm also not sure how this is meant to fix it. Special casing validation 
compaction may fix repairs but you'd still get the digest mismatches on reads.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222304#comment-15222304
 ] 

Michael Kjellman commented on CASSANDRA-11349:
--

And just for my sanity and for discussion in the Jira, here is the current 
handling in the comparator

{code}
if (c1 instanceof RangeTombstone)
{
if (c2 instanceof RangeTombstone)
{
RangeTombstone t1 = (RangeTombstone)c1;
RangeTombstone t2 = (RangeTombstone)c2;
int comp2 = AbstractCellNameType.this.compare(t1.max, t2.max);
return comp2 == 0 ? t1.data.compareTo(t2.data) : comp2;
}
else
{
return -1;
}
}
else
{
return c2 instanceof RangeTombstone ? 1 : 0;
}
}
{code}

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Michael Kjellman (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1597#comment-1597
 ] 

Michael Kjellman commented on CASSANDRA-11349:
--

[~spo...@gmail.com] [~slebresne] [~frousseau] I'm confused here. Why should 
repair be special cased over normal compaction in this case? If the times are 
different then you *do* still need to resolve it as you need to take the 
greater time.

It seems to me the crux of the current patch is to "fix" this by special casing 
the comparator to just compare just the max value of the interval during repair 
validation:

{code}
// only compare interval, but not deletion time
+  return AbstractCellNameType.this.compare(((RangeTombstone)c1).max, 
((RangeTombstone)c2).max);
{code}

I just did my best to merge and compare the code between 2.0 and 2.1 and I'm 
still trying to parse how this code is different in 2.0 vs. 2.1... We've been 
unable to reproduce this in 2.0 so far, but the bits of the code being touched 
here don't seem to be different so I'm trying to understand why 2.1 would hit 
this and not 2.0.

Could you please explain a bit more why we we can ignore the timestamp?


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221919#comment-15221919
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

I tested against the 3.0.4 and it is not affected (not tested the 3.X, but 
assumed that it's not affected).
There is another similar ticket (the 3.0.4 is not affected): 
https://issues.apache.org/jira/browse/CASSANDRA-11477 

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221878#comment-15221878
 ] 

Sylvain Lebresne commented on CASSANDRA-11349:
--

I suspect this doesn't affect 3.x: has someone checked, and if not, can someone 
do so we know if some 3.x version is needed or not for this?

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-04-01 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221854#comment-15221854
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Thanks, the patch is OK.

In fact, the differences were produced by another bug that I will create 
separately.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-30 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218940#comment-15218940
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

I tested the patch on a dev environment containing production data and there 
were still some differences.

The test procedure was:
 - use ccm & the branch linked to this ticket (I verified that the classpath is 
ok)
 - copy a fresh backup of production data
 - did a first full repair -> it had some differences on all CFs but this can 
be explained if there is a small delay when snapshotting all hosts for the 
backup (this cluster receive a few thousands writes per second)
 - did a second full repair -> only one CF had no differences (the one without 
range tombstone) and all others had differences (while they should not because 
there are no reads & writes on this dev environment)

I will continue to investigate and try to isolate those differences...

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-23 Thread Fabien Rousseau (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208851#comment-15208851
 ] 

Fabien Rousseau commented on CASSANDRA-11349:
-

Nice patch.

I will be able to test it on a dev environment either this week or at the 
beginning of next week.

There is still one case not covered (though not sure it can happen).
Suppose that in SSTable 1, there is a range tombstone covering the columns "a" 
through "g" at time t1, and in SSTable 2, there is a range tombstone covering 
the columns "c" through "d" at time t2.
If those two SSTable are merged (for example on another replica), it will be 
split in 3 range tombstones in one SSTable (one range tombstone "a" -> "b" at 
t1, "c" -> "d" at t2, "e" -> "f" at t1).
Computing the merkle tree for those two hosts will still be different (not the 
same range tombstones).

As said above, not sure if it can happen, and anyway, this patch is a good 
improvement and probably fits 99% cases.



> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>  Labels: repair
> Fix For: 2.1.x, 2.2.x
>
> Attachments: 11349-2.1.patch
>
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-23 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208345#comment-15208345
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:




I gave the patch some more thoughts and I'm now confident that the proposed 
change is the best way to address the issue. 

Basically what happens during validation compaction is that a scanner is 
created for each sstable. The {{CompactionIterable.Reducer}} will then create a 
{{LazilyCompactedRow}} with an iterable of {{OnDiskAtom}} for the same key in 
each sstable. The purpose of {{LazilyCompactedRow}} during validation 
compaction is to create a digest of the compacted version of all atoms that 
would represent a single row. This is done cell by cell, where each collection 
of atoms for a single cell name is consumed by {{LazilyCompactedRow.Reducer}}.
 
 The decision on whether {{LazilyCompactedRow.Reducer}} should finish to merge 
a cell and move to the next one is currently being done by 
{{AbstractCellNameType.onDiskAtomComparator}}, as evaluated by 
{{MergeIterator.ManyToOne}}. However, the comparator does not only compare by 
name, but also by {{DeletionTime}} in case of {{RangeTombstone}}. As a 
consequence, {{MergeIterator.ManyToOne}} will advance in case two 
{{RangeTombstone}} with different deletion times are read, which breaks the 
"_will be called one or more times with cells that share the same column name_" 
contract in {{LazilyCompactedRow.Reducer}}.

The submitted patch will introduce a new {{Comparator}} that will 
basically work like {{onDiskAtomComparator}}, but does not compare deletion 
time. As simple as that.


||2.1||2.2||
|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]|
|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]|


The only other places where {{LazilyCompactedRow}} is being used except 
validation compaction are the cleanup and scrub functions, which shouldn't be 
affected, as those are working on individual sstables and I assume that there's 
no case where an sstable can have multiple identical range tombstones with 
different timestamps.


> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>Assignee: Stefan Podkowinski
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumu

[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval

2016-03-21 Thread Stefan Podkowinski (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204971#comment-15204971
 ] 

Stefan Podkowinski commented on CASSANDRA-11349:


Looks like the {{MergeIterator.ManyToOne}} logic gets in the way of 
{{LazilyCompactedRow.Reducer}} doing it's job. The iterator will stop adding 
atoms to the reducer and continue to advance, once two range tombstones with 
different deletion times are about to be merged.

> MerkleTree mismatch when multiple range tombstones exists for the same 
> partition and interval
> -
>
> Key: CASSANDRA-11349
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11349
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Fabien Rousseau
>
> We observed that repair, for some of our clusters, streamed a lot of data and 
> many partitions were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, 
> which is really high.
> After investigation, it appears that, if two range tombstones exists for a 
> partition for the same range/interval, they're both included in the merkle 
> tree computation.
> But, if for some reason, on another node, the two range tombstones were 
> already compacted into a single range tombstone, this will result in a merkle 
> tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent 
> on compactions (and if a partition is deleted and created multiple times, the 
> only way to ensure that repair "works correctly"/"don't overstream data" is 
> to major compact before each repair... which is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 2};
> USE test_rt;
> CREATE TABLE IF NOT EXISTS table1 (
> c1 text,
> c2 text,
> c3 float,
> c4 float,
> PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected 
> between nodes (while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables 
> (up to thousands for a rather short period of time when using VNodes, the 
> time for compaction to absorb those small files), but also an increased size 
> on disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)