[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15792783#comment-15792783 ] Stefan Podkowinski commented on CASSANDRA-11349: I've looked at some metrics today for one of our clusters that has been updated to 2.1.16 a couple of weeks ago. We used to see tens of thousands of sstables getting streamed each night during repairs with many GBs. With 2.1.16 the number of streamed sstables went down to almost none. Thanks for fixing this to everyone involved! :) > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Branimir Lambov > Labels: repair > Fix For: 2.1.16, 2.2.8 > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383900#comment-15383900 ] Fabien Rousseau commented on CASSANDRA-11349: - Here is some quick feedback: patch is deployed on production for more than a week and it diminished a lot the streaming during repairs. A patched version of C* 2.1.14 (containing the patch) has been deployed on all of our production clusters more than a week ago and it works well. There are still a few differences during repairs, but a lot less than before, and this is "manageable". Thanks to all of you for your help. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Branimir Lambov > Labels: repair > Fix For: 2.1.16, 2.2.8 > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361168#comment-15361168 ] Branimir Lambov commented on CASSANDRA-11349: - Tests look ok, all failures are either failing on base or failed very recently. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361016#comment-15361016 ] Branimir Lambov commented on CASSANDRA-11349: - Rebased patch here: |[2.1|https://github.com/blambov/cassandra/tree/11349]|[utests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-dtest/]| |[2.2|https://github.com/blambov/cassandra/tree/11349-2.2]|[utests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-2.2-testall/]|[dtests|http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-11349-2.2-dtest/]| > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352747#comment-15352747 ] Sylvain Lebresne commented on CASSANDRA-11349: -- Had a look here, and I'm more comfortable with sticking to [~blambov] approach. For 2.1 and 2.2, we're now in "only critical bug fixes" and running things through RTL definitively changes things too much for my comfort. That imply I'm fine not fixing every possible problems if that gets us too far (especially since it's properly fixed in 3.0 and not that many people seems to have reported this). And Branimir's approach seems to be making a good enough impact in practice. So [~blambov], could you rebase your patch for 2.1 and 2.2 and run CI. After which, if tests are good, I'm +1 committing. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15333457#comment-15333457 ] Fabien Rousseau commented on CASSANDRA-11349: - Just to let you know that we packaged the patch done by Branimir (as it is the one that have more chances to be included mainstream). We restored one cluster (3 nodes, 100GB of data per node, affected table is 25GB) from a snapshot on new hardware, and did a full repair. So far, so good, not much differences are found for the affected table but this was expected because repairs are not run for a few months (around a hundred VS a few hundred of thousands before). We will continue testing by recreating all of our clusters, and then, deploy it on our production (and I'll let you know once this is done). > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15324477#comment-15324477 ] Stefan Podkowinski commented on CASSANDRA-11349: You're correct by pointing out that live columns can prevent fully normalizing all RTs using the RTL approach in patch v4. It will still be more accurate than without RTL consolidation, but the question is if the additional complexity is worth it. If you'd be more comfortable going with the patch initially suggested by yourself, I'm confident that this will still be a big improvement. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321446#comment-15321446 ] Branimir Lambov commented on CASSANDRA-11349: - Does this really solve the problem with the test you mentioned? Putting the tombstones through {{RangeTombstoneList}} will normalize them, but they may not be issued in the right position, i.e. the RTL solution only works if the data contains only tombstones. For example, the {{\["b:d:\!","b:\!",1463656272792,"t",1463731877\]}} part from the test above gets issued before a potential token that may come before {{b:d:!}}. The test needs to be extended to include live tokens, for example by adding each of {code} INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'b', 'a', 1) {code} or {code} INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'd', 'a', 1) {code} or {code} INSERT INTO table1 (c1, c2, c3, c4) VALUES ('b', 'e', 'a', 1) {code} after the deletions. The RTL solution will break (in different ways) for at least two of the above. It also has performance implications that I am not really happy to take. A proper solution is to either fully replicate what RTL does in the tombstone tracker (which may be not be worth it so late in the lifespan of 2.1 and 2.2), or make the tombstone tracker wrap around an RTL (which may be inefficient and is still somewhat tricky). If (as Fabien's testing seems to imply) doing the digest update as serialization solves the majority of the differences and repair pain, I would prefer to stop there. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15312087#comment-15312087 ] Stefan Podkowinski commented on CASSANDRA-11349: I've now attached a patch for the last mentioned implementation as {{11349-2.1-v4.patch}} and {{11349-2.2-v4.patch}} to the ticket. Test results are as follows (reported failures cannot be reproduced locally and seem to be unrelated to me): ||2.1||2.2|| |[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]| |[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]| |[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]| Anyone willing to take another look and actually commit a patch for this issue? I've pushed my WIP branch [here|https://github.com/spodkowinski/cassandra/commits/WIP2-11349] with individual commits that might help during the review. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, > 11349-2.1-v4.patch, 11349-2.1.patch, 11349-2.2-v4.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15308051#comment-15308051 ] Fabien Rousseau commented on CASSANDRA-11349: - Thanks Stefan. So if I understand well, your latest branch does not changes how SSTables are serialized on disk (by using 2 specialized serializers: one for compaction and one for validation) but still solves all cases (or at least all known cases). Any chance that this patch can be included in 2.1.15 ? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293719#comment-15293719 ] Stefan Podkowinski commented on CASSANDRA-11349: I've now created a new patch version [here|https://github.com/spodkowinski/cassandra/commit/c8601f8cd3921e754bcbe8c9362cf3d2e7072e1e] that basically combines both of your ideas of doing the digest updates in the serializer and using {{RangeTombstonesList}} to normalize RT intervals. Tests look good, feel free to add your own. [~blambov], can you think of any further cases that would not be covered by this approach? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293709#comment-15293709 ] Fabien Rousseau commented on CASSANDRA-11349: - Ok, it appears that the initial idea by [~blambov] is sufficient (after having done some basic testing for our 4th cluster). Nevertheless, I'm surprised that we seems to be the only one affected by this issue. Maybe it's because it took us some time to realize it and investigate it, and there was no clear sign apart from big streams during repairs + data set size increasing too fast. So this may explain why not many people reported it, but there may be others affected out in the wild. That's why it's probably best to try to fix most of it (if it's not possible to fix it entirely), but I also understand that the less changes there are, the less risky it is... So I'm good with it either partially fixed or mostly fixed. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15293041#comment-15293041 ] Fabien Rousseau commented on CASSANDRA-11349: - [~blambov] We have 4 clusters impacted by this bug, and for 3 out of 4, what you have in mind works. I still need to verify for the 4th one. I'll try to verify this today. Regarding the 3.0, migrating 60 nodes is not something done easily. [~spo...@gmail.com] Yes, there are 3 RT on node2, because, in memory, RT are stored in a RangeTombstoneList (then serialized). The RangeTombstoneList automatically split the tombstones which are overlapping. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291211#comment-15291211 ] Stefan Podkowinski commented on CASSANDRA-11349: I've been debuging the latest mentioned error case using the following cql/ccm statements and a local 2 node cluster. {code} create keyspace ks WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2}; use ks; CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float, PRIMARY KEY (c1, c2, c3) ) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 'false'}; DELETE FROM table1 USING TIMESTAMP 1463656272791 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272792 WHERE c1 = 'a' AND c2 = 'b'; ccm node1 flush DELETE FROM table1 USING TIMESTAMP 1463656272793 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'd'; ccm node1 flush {code} Timestamps have been added for easier tracking of the specific tombstone in the debugger. ColmnIndex.Builder.buildForCompaction() will add tombstones in the following order to the tracker: *Node1* {{1463656272792: c1 = 'a' AND c2 = 'b'}} First RT, added to unwritten + opened tombstones {{1463656272791: c1 = 'a' AND c2 = 'b' AND c3 = 'c'}} Overshadowed by RT added before while being older at the same time. Will not be added and simply ignored. {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} Overshaded by first and only RT added to opened so far, but newer and will thus be added to unwritten+opened We end up with 2 unwritten tombstones (..92+..93) passed to the serializer for message digest. *Node2* {{1463656272792: c1 = 'a' AND c2 = 'b'}} (EOC.START) First RT, added to unwritten + opened tombstones {{1463656272793: c1 = 'a' AND c2 = 'b' AND c3 = 'd'}} (EOC.END) comparision of EOC flag (Tracker:251) of previously added RT will cause having it removed from the opened list (Tracker:258). Afterwards the current RT will be added to unwritten + opened. {{1463656272792: c1 = 'a' AND c2 = 'b'}} ({color:red}again!{color}) Gets compared with prev. added RT, which supersedes the current one and thus stays in the list. Will again be added to unwritten + opened list. We end up with 3 unwritten RTs, including 1463656272792 twice. I still haven't been able to exactly pinpoint why the reducer will be called twice with the same TS, but since [~blambov] explicitly mentioned that possibility, I guess it's intended behavior (but why? :)). > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat}
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15290648#comment-15290648 ] Branimir Lambov commented on CASSANDRA-11349: - There will be cases where this {{RangeTombstoneList}} solution is not sufficient (e.g. inserting {{c1 = 'a' AND c2 = 'b' AND c3 = 'a'}} at the end of the test above). Is it imperative that we fix all scenarios here if 3.0 has the proper solution? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287685#comment-15287685 ] Fabien Rousseau commented on CASSANDRA-11349: - Ok, this seems a better approach (with this approach RT are added to the tracker through the ColumnIndex.add method). I had some time to test it on a dev environment and repair did not find any difference (which is a good thing). Regarding the case that is not working correctly, I think the solution is to use a RangeTombstoneList before writing RangeTombstones. The current implementation Tracker.writeUnwrittenTombstones(...) is: {noformat} for (RangeTombstone rt : unwrittenTombstones) { size += writeTombstone(rt, out, atomSerializer); } {noformat} And should be replaced by: {noformat} RangeTombstoneList rtl = new RangeTombstoneList(comparator, unwrittenTombstones.size()); for (RangeTombstone rt : unwrittenTombstones) { rtl.add(rt); } for (RangeTombstone rt : rtl) { size += writeTombstone(rt, out, atomSerializer); } {noformat} I haven't tested this but it should work. The explanation for this is the following: - on node1, due to the flushes, each RT is written in its own SSTable - on node2, because all RTs are kept in memory, they're kept in a RangeTombstoneList. This RangeTombstoneList will keep non overlapping RTs. During repair, on node1, RTs are merged but are kept as is (ie some RTs can be overlapped) while on node2, they can't. By using the RangeTombstoneList before serializing the unwritten RT, no RT can overlap another RT. Note: doing the change above will also change the way RT are serialized during normal compactions... > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280088#comment-15280088 ] Stefan Podkowinski commented on CASSANDRA-11349: Thanks for the clarification. It's really helpful to understand the intention of how those parts are suppose to work together. The serializer approach seems to be a good idea how to handle this, but there are still [cases|https://github.com/spodkowinski/cassandra-dtest/blob/b110685bceddbcb63ebc744ba54a25cb268f2478/repair_tests/repair_test.py#L438:L451] \[1\] not handled correctly. I'm going to take a closer look to understand why. I'd also like to do some more testing for potential digest mismatch storms during rolling upgrades, but wouldn't expect any blockers so far. \[1\] nosetests repair_tests/repair_test.py:TestRepair.shadowed_range_tombstone_digest_parallel_repair_test > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278032#comment-15278032 ] Branimir Lambov commented on CASSANDRA-11349: - Not precisely: One part of the problem is that we cannot ensure that e.g. the same range tombstone will not come twice from the same sstable (in which case {{MergeIterator}} would issue to separate {{getReduced}} calls) or from two different sstables (in which case {{MergeIterator}} would call {{getReduced}} once with both) or some complex combination of these. Another is that while compaction uses the tracker to identify when a tombstone is redundant and can be omitted, {{getReduced}} does not have that information at the time it processes that tombstone because the covering tombstone has not arrived yet. The tracker can properly resolve these situations, but it can't do it without delaying which causes the necessity for abusing the serializer. The reducer only adds RTs to the tracker if it would not return them in the output for some reason (e.g. expiration), the point being to always pass on the full stream of RTs; the order should not be affected by it choosing to do that. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277977#comment-15277977 ] Stefan Podkowinski commented on CASSANDRA-11349: To quickly sum up the current behavior.. {{ColumnIndex.Builder}} is created for each {{LazilyCompactedRow.update()}} call. The builder will iterate through all atoms produced by the {{MergeIterator}} and uses a {{RangeTombstone.Tracker}} instance for tombstone normalization. Tombstones will be added to the tracker from {{Builder.add()}} and by {{LCR.Reducer.getReduced()}}, which in turn will be called once for all atoms for the same column as considered by {{onDiskAtomComparator}}. [~blambov], so what you're saying is that we can't be sure that the {{MergeIterator}} will always be able to provide deterministic ordered values, as write order may be different and we therefor cannot simply iterate through the reducer to create a correct digest. What I'm a bit concerned about while trying to understand Branimir's approach is that at some point {{getReduced()}} will add the RT to the tracker while in another scenario the RT will be added later and will cause the serializer be called differently as well. Or to put this in other words, if we can't be sure about the reducer returning deterministic ordered values, won't this effect the tracker and digest calculation in the builder as well? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272427#comment-15272427 ] Branimir Lambov commented on CASSANDRA-11349: - I have something like [this|https://github.com/apache/cassandra/compare/trunk...blambov:11349] in mind. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270572#comment-15270572 ] Branimir Lambov commented on CASSANDRA-11349: - As I see it neither solution will be sufficient. A lot of the visible effects of the problem come as a side effect of CASSANDRA-7953, but there are some underlying issues that are only really solved in 3.0 by the new tombstone handling from CASSANDRA-8099. Whether we change {{onDiskAtomComparator}} or not, we will still get disordered or multiple equal range tombstones from a single source as that's how they are written in the sstables. {{MergeIterator}} will not combine equal entries from the same source, even if it did and everything was written using the {{onDiskAtomComparator}} (which I don't believe to be the case), it is still in the wrong order for resolving which tombstones can be deleted without delaying their processing. In other words the problem cannot be solved by changing the reducer; we can, however, do it if we change {{update}} to follow closely or, better still, _call_ {{IndexBuilder.buildForCompaction}} and make the builder accept a prepared atom serializer (or some subinterface) instead of an output file, and update the digest in the calls to that serializer. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15268627#comment-15268627 ] Stefan Podkowinski commented on CASSANDRA-11349: Sounds reasonable and I agree that the code changes should be less invasive as possible. We're talking about 2.x so we should avoid heavy refactoring. Mentioned class design could possibly still be improved, but that depends where to go from here.. To wrap up available patch options: 1) {{1349-2.1.patch}} with 2 changed lines would address the issue initially described 2) {{11349-2.1-v3.patch}} introduces a bit more changes but will also create correct digests for shadowed range tombstones and cells (see [dtest|https://github.com/spodkowinski/cassandra-dtest/blob/CASSANDRA-11349/repair_tests/repair_test.py#L425]) Any opinions on this except from Fabian's and mine? It would be good to get some feedback from someone how would be actually willing to commit something like this. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1-v3.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266682#comment-15266682 ] Fabien Rousseau commented on CASSANDRA-11349: - Great. I just created a new patch (11349-2.1-v3.patch) where the 'update' method is empty in the validation tracker (in fact, it was a left-over from previous attempts and should have been empty ). The main difference for example between the "update" method from the regular compaction, and the addRangeTombstone from the validation compaction is the returned value. In the latter case, it's wether the RT is superseded/shadowed by another previously met tombstone. To be honest, I did not managed to factorize both of them without compromising readability even if they share some similarities. I'm a bit skeptical with ValidationCompactionTracker extending RegularCompactionTracker because RegularCompaction has more fields (unwrittenTombstones, atomCount) which would not be used by the ValidationCompactionTracker (and it feels odd to have unused fields). Doing it the other side, ie RegularCompactionTracker extending ValidationCompactionTracker, seemed a better fit (RegularCompaction reuses the comparator and openedTombstones), adds more fields, but there is not much to win: only the isDeleted method is in common... Thus the interface did not seem a bad choice: implementations are less coupled (and could diverge more in the future if needed). But this can be changed if needed (I just wanted to explain design choices and am not opposed to inheritance) I agree that this way of doing is a "leaky abstraction". Nevertheless, the main idea is to have a patch doing minimal architectural changes to the current code base (did not want to refactor anything) to avoid introducing bugs. Moreover, because the 3.X and 3.0.X are not affected, this will stay in the 2.1.X and 2.2.X branches (and won't be technical debt). Anyway, it's more a pragmatic solution than an elegant one (and evidently, I am open to a more elegant solution). > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15266474#comment-15266474 ] Stefan Podkowinski commented on CASSANDRA-11349: I'm not sure introducing a new tracker interface is the best way to handle this. It took me a while to actually figure out the differences between the {{update}} implementations in both trackers, since for most parts it's sharing the same copied code. It would probably be better to have ValidationTracker subclass RegularCompactionTracker, add {{remove/addUnwrittenTombstone}} implemented empty for validation. The {{addRangeTombstone}} semantics also look like a case of leaky abstractions to me. It's adding nothing at all for regular compaction, but serves as early exit path for validation. Good news is that the dtests and unit tests seem to pass with the patch. :) > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262171#comment-15262171 ] Fabien Rousseau commented on CASSANDRA-11349: - Ok, I uploaded a new version of the patch (11349-2.1-v2.patch) As said above, there are two Tracker implementations now: one for regular compaction and another one for validation compaction. It solves both cases described here (the one in the ticket + the one in the comment of Stefan) and CASSANDRA-11477. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15254670#comment-15254670 ] Fabien Rousseau commented on CASSANDRA-11349: - Sorry for not being reactive lately, I'm rather busy atm... I'd be more than happy[1] to see this patch in the next release. I haven't tested it yet and probably can find some time next week to test it on a dev cluster if it can help. Nevertheless, I won't be able to tell if it really worked because there will still have some mismatches (due to CASSANDRA-11477). I have started working on a patch which should be able to handle both CASSANDRA-11477 and the last edge case. What it basically does: - Tracker is now an interface - there are two implementations: one called RegularCompactionTracker and another one ValidationCompactionTracker - the ColumnIndexer.Builder has one more optional parameter : a boolean to know if it is built for validation - the RegularCompactionTracker is identical to the existing Tracker + one empty method - the ValidationCompactionTracker is similar to the existing Tracker but retain only opened tombstones (most methods are thus empty) - the Reducer slightly changed but its behaviour is the same regarding the regular compactions I can share it if you're interested (code compiles but I still haven't tested it at all and plan to do it soon and share it after). [1] Just to share more information: those issues are important to us, because a few of our clusters are impacted and a few days after filing the bug, we decided to temporarily stop repairing some tables (knowing that we could live with inconsistencies on those tables) which were heavily impacted by those bugs (each repair increased disk occupancy by a few percent), and did a major compaction. This resulted in two to three times less disk occupancy (One table shrinked from 243GB to 79GB. Note that this was not due to tombstones reclaiming old data because, it's been nearly a month now, and the big SSTable resulting from the major compaction is still there but disk usage has not grown that much). > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253514#comment-15253514 ] Stefan Podkowinski commented on CASSANDRA-11349: Can we keep the conversation going to get this patch into the next 2.x release? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230127#comment-15230127 ] Stefan Podkowinski commented on CASSANDRA-11349: [~frousseau], what makes things more complicated here is that changes to LCR will effect regular compactions as well. Adding all tombstones as expired in your {{11349-2.1-v2.patch}} will have unwanted side effects for regular compactions, e.g. try {{RangeTombstoneMergeTest}} with it. I've now spend some time trying to make use of the RT.Tracker there but without much success. Adding non-expired range tombstones to the tracker from within LCR would cause corrupted sstables. Even creating an edge case just for validation compaction would not handle all potential TS shadowing scenarios and will probably cause more harm than good (and potential digest mismatch storms). I'm not even sure it's possible given the current iterative MergeIterator > LazilyCompactedRow > RT.Tracker interaction. I'm now at a point where I'd suggest to just stick with {{11349-2.1.patch}} unless someone else has a better idea how to solve this. I've updated the [dtest PR|https://github.com/riptano/cassandra-dtest/pull/881] with two of the described shadowing scenarios that will only work with 3.0+ even after the patch, if someone wants to give it a try. Cassci results for {{11349-2.1.patch}}: ||2.1||2.2|| |[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]| |[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]| |[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]| > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1-v2.patch, 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227163#comment-15227163 ] Fabien Rousseau commented on CASSANDRA-11349: - Using the RangeTombstone.Tracker can help in the situation described just above. In fact, the RT should always update the tracker (see CASSANDRA-11477). The trick here is to always considered it as "expired" in the tracker (even if not) so the tombstones are not accumulated during compaction (if expired the tracker keeps only the list of opened RTs and if not, it keeps all unwritten RTs, ie all RTs because it's a validation compaction...). Having a look at the update method of the Tracker, it already check if the tombstone is superseded by another one (and don't add it as "opened" if superseded). Thus, the v2 patch: - includes the previous patch - always update the tracker with the RT (considering it as expired even if not, just to not retain too many of them in memory, and because it's for validation, it's a read only and won't affect anything) - test if the RT was added in the openedTombstones list, and if that's not the case, skip it for digest. I know that the patch may be a bit rough (at least on the "isLastOpened" method) but it is more to validate the approach first and did not want the patch to be too invasive (by modifying the returned value of the update method). WDYT ? Note: I have not yet tested it against our production data Note2: Regarding the read-repair, this seems to be a different story and can't see anything for now that could explain those differences (will dig later on this as this is less urgent) > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222902#comment-15222902 ] Stefan Podkowinski commented on CASSANDRA-11349: It makes sense just to modify {{onDiskAtomComparator}}. Given the generic name I assumed the comparator is used in other places as well, but since its only used in {{LazyCompactedRow}} we can just change the patch as suggested and simply remove the timestamp tie-break behaviour in {{onDiskAtomComparator}}. As for regular compactions, I agree with Tyler that this should not effect compactions in a way that it does with validation compaction. Before the patch, {{LazyCompactedRow}} would not reduce both RTs but instead have {{ColumnIndex.buildForCompaction()}} iterate over both RTs and have them added to the {{RangeTombstone.Tracker}}. The tracker would merge them in a way {{LCR.Reducer.getReduced}} would after the patch. However, I’m not fully sure if there could be some other for more complex cases where this still would cause problems. Although the patch should fix the described issue, the way we deal with RTs during validation compaction is still not ideal. The problem is that LCR lacks some handling of relationships between RTs compared to {{RangeTombstone.Tracker}}. If we create digests column by column, we get wrong results for shadowing tombstones not sharing the same intervals. {noformat} CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 2}; USE test_rt; CREATE TABLE IF NOT EXISTS table1 ( c1 text, c2 text, c3 text, c4 float, PRIMARY KEY (c1, c2, c3) ) WITH compaction = {'class': 'SizeTieredCompactionStrategy', 'enabled': 'false'}; DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b' AND c3 = 'c'; ccm node1 flush DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; ccm node1 repair test_rt table1 {noformat} In this case the (c1, c2, c3) RT will always be repaired after it has been compacted with (c1, c2) on any node. So I’m wondering if we shouldn’t take a more bold approach here than the patch does. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222365#comment-15222365 ] Tyler Hobbs commented on CASSANDRA-11349: - I think there are a couple of things wrong with the current patch. First, the comparator needs to continue to compare the full cell name first, and only break ties on range tombstones with {{compare(t1.max, t2.max)}}. Second, I believe we should remove the timestamp tie-breaking behavior from the comparator in general, and not just for validation compactions. In other words, I think we're doing the comparison incorrectly for all compactions right now. We want the comparison to return 0 whenever range tombstones have equal names and ranges, even if they have different timestamps. This will result in {{LazilyCompactedRow.Reducer.reduce()}} being called in one round with each of the tombstones that only differ in timestamp. The logic in {{LCR.Reducer.reduce()}} already handles the case of multiple range tombstones with different timestamps by picking the one with the highest timestamp, so these will correctly be reduced to a single RT. It looks like the current codebase will keep both range tombstones during a compaction, which isn't necessarily harmful, but is suboptimal. For repair purposes, though, this is incorrect as it produces a different digest. To summarize: I think all we need to do is remove the timestamp tie-breaking logic from the existing comparator. [~slebresne] should double-check my logic, though. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222330#comment-15222330 ] Richard Low commented on CASSANDRA-11349: - I'm also not sure how this is meant to fix it. Special casing validation compaction may fix repairs but you'd still get the digest mismatches on reads. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222304#comment-15222304 ] Michael Kjellman commented on CASSANDRA-11349: -- And just for my sanity and for discussion in the Jira, here is the current handling in the comparator {code} if (c1 instanceof RangeTombstone) { if (c2 instanceof RangeTombstone) { RangeTombstone t1 = (RangeTombstone)c1; RangeTombstone t2 = (RangeTombstone)c2; int comp2 = AbstractCellNameType.this.compare(t1.max, t2.max); return comp2 == 0 ? t1.data.compareTo(t2.data) : comp2; } else { return -1; } } else { return c2 instanceof RangeTombstone ? 1 : 0; } } {code} > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1597#comment-1597 ] Michael Kjellman commented on CASSANDRA-11349: -- [~spo...@gmail.com] [~slebresne] [~frousseau] I'm confused here. Why should repair be special cased over normal compaction in this case? If the times are different then you *do* still need to resolve it as you need to take the greater time. It seems to me the crux of the current patch is to "fix" this by special casing the comparator to just compare just the max value of the interval during repair validation: {code} // only compare interval, but not deletion time + return AbstractCellNameType.this.compare(((RangeTombstone)c1).max, ((RangeTombstone)c2).max); {code} I just did my best to merge and compare the code between 2.0 and 2.1 and I'm still trying to parse how this code is different in 2.0 vs. 2.1... We've been unable to reproduce this in 2.0 so far, but the bits of the code being touched here don't seem to be different so I'm trying to understand why 2.1 would hit this and not 2.0. Could you please explain a bit more why we we can ignore the timestamp? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221919#comment-15221919 ] Fabien Rousseau commented on CASSANDRA-11349: - I tested against the 3.0.4 and it is not affected (not tested the 3.X, but assumed that it's not affected). There is another similar ticket (the 3.0.4 is not affected): https://issues.apache.org/jira/browse/CASSANDRA-11477 > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221878#comment-15221878 ] Sylvain Lebresne commented on CASSANDRA-11349: -- I suspect this doesn't affect 3.x: has someone checked, and if not, can someone do so we know if some 3.x version is needed or not for this? > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221854#comment-15221854 ] Fabien Rousseau commented on CASSANDRA-11349: - Thanks, the patch is OK. In fact, the differences were produced by another bug that I will create separately. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218940#comment-15218940 ] Fabien Rousseau commented on CASSANDRA-11349: - I tested the patch on a dev environment containing production data and there were still some differences. The test procedure was: - use ccm & the branch linked to this ticket (I verified that the classpath is ok) - copy a fresh backup of production data - did a first full repair -> it had some differences on all CFs but this can be explained if there is a small delay when snapshotting all hosts for the backup (this cluster receive a few thousands writes per second) - did a second full repair -> only one CF had no differences (the one without range tombstone) and all others had differences (while they should not because there are no reads & writes on this dev environment) I will continue to investigate and try to isolate those differences... > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208851#comment-15208851 ] Fabien Rousseau commented on CASSANDRA-11349: - Nice patch. I will be able to test it on a dev environment either this week or at the beginning of next week. There is still one case not covered (though not sure it can happen). Suppose that in SSTable 1, there is a range tombstone covering the columns "a" through "g" at time t1, and in SSTable 2, there is a range tombstone covering the columns "c" through "d" at time t2. If those two SSTable are merged (for example on another replica), it will be split in 3 range tombstones in one SSTable (one range tombstone "a" -> "b" at t1, "c" -> "d" at t2, "e" -> "f" at t1). Computing the merkle tree for those two hosts will still be different (not the same range tombstones). As said above, not sure if it can happen, and anyway, this patch is a good improvement and probably fits 99% cases. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > Labels: repair > Fix For: 2.1.x, 2.2.x > > Attachments: 11349-2.1.patch > > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208345#comment-15208345 ] Stefan Podkowinski commented on CASSANDRA-11349: I gave the patch some more thoughts and I'm now confident that the proposed change is the best way to address the issue. Basically what happens during validation compaction is that a scanner is created for each sstable. The {{CompactionIterable.Reducer}} will then create a {{LazilyCompactedRow}} with an iterable of {{OnDiskAtom}} for the same key in each sstable. The purpose of {{LazilyCompactedRow}} during validation compaction is to create a digest of the compacted version of all atoms that would represent a single row. This is done cell by cell, where each collection of atoms for a single cell name is consumed by {{LazilyCompactedRow.Reducer}}. The decision on whether {{LazilyCompactedRow.Reducer}} should finish to merge a cell and move to the next one is currently being done by {{AbstractCellNameType.onDiskAtomComparator}}, as evaluated by {{MergeIterator.ManyToOne}}. However, the comparator does not only compare by name, but also by {{DeletionTime}} in case of {{RangeTombstone}}. As a consequence, {{MergeIterator.ManyToOne}} will advance in case two {{RangeTombstone}} with different deletion times are read, which breaks the "_will be called one or more times with cells that share the same column name_" contract in {{LazilyCompactedRow.Reducer}}. The submitted patch will introduce a new {{Comparator}} that will basically work like {{onDiskAtomComparator}}, but does not compare deletion time. As simple as that. ||2.1||2.2|| |[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.1]|[branch|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-11349-2.2]| |[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-testall/]| |[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.1-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/spodkowinski/job/spodkowinski-CASSANDRA-11349-2.2-dtest/]| The only other places where {{LazilyCompactedRow}} is being used except validation compaction are the cleanup and scrub functions, which shouldn't be affected, as those are working on individual sstables and I assume that there's no case where an sstable can have multiple identical range tombstones with different timestamps. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau >Assignee: Stefan Podkowinski > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumu
[jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
[ https://issues.apache.org/jira/browse/CASSANDRA-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204971#comment-15204971 ] Stefan Podkowinski commented on CASSANDRA-11349: Looks like the {{MergeIterator.ManyToOne}} logic gets in the way of {{LazilyCompactedRow.Reducer}} doing it's job. The iterator will stop adding atoms to the reducer and continue to advance, once two range tombstones with different deletion times are about to be merged. > MerkleTree mismatch when multiple range tombstones exists for the same > partition and interval > - > > Key: CASSANDRA-11349 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11349 > Project: Cassandra > Issue Type: Bug >Reporter: Fabien Rousseau > > We observed that repair, for some of our clusters, streamed a lot of data and > many partitions were "out of sync". > Moreover, the read repair mismatch ratio is around 3% on those clusters, > which is really high. > After investigation, it appears that, if two range tombstones exists for a > partition for the same range/interval, they're both included in the merkle > tree computation. > But, if for some reason, on another node, the two range tombstones were > already compacted into a single range tombstone, this will result in a merkle > tree difference. > Currently, this is clearly bad because MerkleTree differences are dependent > on compactions (and if a partition is deleted and created multiple times, the > only way to ensure that repair "works correctly"/"don't overstream data" is > to major compact before each repair... which is not really feasible). > Below is a list of steps allowing to easily reproduce this case: > {noformat} > ccm create test -v 2.1.13 -n 2 -s > ccm node1 cqlsh > CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 2}; > USE test_rt; > CREATE TABLE IF NOT EXISTS table1 ( > c1 text, > c2 text, > c3 float, > c4 float, > PRIMARY KEY ((c1), c2) > ); > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > # now flush only one of the two nodes > ccm node1 flush > ccm node1 cqlsh > USE test_rt; > INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3); > DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b'; > ctrl ^d > ccm node1 repair > # now grep the log and observe that there was some inconstencies detected > between nodes (while it shouldn't have detected any) > ccm node1 showlog | grep "out of sync" > {noformat} > Consequences of this are a costly repair, accumulating many small SSTables > (up to thousands for a rather short period of time when using VNodes, the > time for compaction to absorb those small files), but also an increased size > on disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)