[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670641#comment-15670641 ] Hai Zhou commented on CASSANDRA-9045: - Is there any new development on this issue? We are running 2.1.13 and have scheduled repairs running too, but still see some deleted rows coming back. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071846#comment-15071846 ] Leonid Kogan commented on CASSANDRA-9045: - Hi there! I'm running apache-cassandra-2.1.7 and experiencing this issue as well. Having it on production system. 5 nodes cluster. Done repairs for all nodes, then compaction. Repeated for several times. Nothing helps. Absolutely desperate. Does anybody have a clue how to solve it? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704767#comment-14704767 ] Marcus Eriksson commented on CASSANDRA-9045: Ping [~r0mant] - any updates? Is this still happening? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.x > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583297#comment-14583297 ] Marcus Eriksson commented on CASSANDRA-9045: been looking at this again today, but have to say that I have no idea what is going on, not able to reproduce could you post your current schema; (describe table bounces;) and logs between 2015-06-04T11:31:38 and 2015-06-08T08:27:36 for the nodes involved in your last example? Could you also run tools/bin/sstablemetadata over the sstables on one of those nodes? Just to check that the timestamps look ok. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.x > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14582508#comment-14582508 ] Roman Tkachenko commented on CASSANDRA-9045: Hey guys. So I implemented writes at EACH_QUORUM several weeks ago and has been monitoring but it does not looks like it fixed the issue. Check this out. I pulled the logs from both datacenters for one of reappeared entries and correlated them with our repairs schedule. Reads (GETs) are done at LOCAL_QUORUM. {code} TimeDC RESULT == 2015-06-04T11:31:38 DC2 GET 200 --> record is present in both DCs 2015-06-04T15:25:01 DC1 GET 200 2015-06-04T19:24:06 DC1 DELETE 200 --> deleted in DC1 2015-06-04T19:45:16 DC2 GET 404 --> record disappeared from both DCs... 2015-06-05T07:10:32 DC1 GET 404 2015-06-05T10:16:28 DC2 GET 200 --> ... but somehow appeared back in DC2 (no POST requests happened for this record) 2015-06-07T18:59:57 DC1 GET 404 2AM NODE IN DC2 REPAIR 4AM NODE IN DC1 REPAIR 2015-06-08T08:27:36 DC1 GET 200 --> record is present in both DCs again, looks like DC2 "repaired" DC1 2015-06-09T15:29:50 DC2 GET 200 2015-06-09T16:05:30 DC1 DELETE 200 2015-06-09T16:05:30 DC1 GET 404 2015-06-09T21:08:24 DC2 GET 404 {code} So the question is how the record managed to appear back in DC2... Do you have any suggestions on how we can investigate this? Thanks, Roman > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.x > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the col
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548611#comment-14548611 ] Roman Tkachenko commented on CASSANDRA-9045: Yeah, I did run `nodetool scrub` on all nodes but found out today that one of the records deleted last week has appeared again. I'm going to adjust our application to perform writes at EACH_QUORUM CL as opposed to LOCAL_QUORUM and see if that helps. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.x > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545485#comment-14545485 ] Marcus Eriksson commented on CASSANDRA-9045: [~r0mant] any updates? Have you run scrub and seen the issue again? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.x > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511163#comment-14511163 ] Marcus Eriksson commented on CASSANDRA-9045: [~r0mant] no, we don't have any way to validate bloom filters you could run scrub or upgradesstables on the suspected nodes, that will rewrite the bloom filters correctly > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509581#comment-14509581 ] Roman Tkachenko commented on CASSANDRA-9045: I assumed this, thank you for clarifying. This particular column was deleted previously but I think this is the symptom of the same bug we're trying to reproduce here. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509580#comment-14509580 ] Roman Tkachenko commented on CASSANDRA-9045: Hi Marcus. I don't have a full list of affected primary keys but I have several and it does not seem they correlate to a specific sub-set of nodes. I guess .151 and .76 appear most often in my reports because they own a key I'm running my experiments on most often. As I mentioned though, it seems to affect "wide" rows only. Is there any way to check for corrupt bloom filters? Can you think of any other possibility of corrupted data (like, sstables) that could've led to this? The cluster survived several pretty severe outages in the past (like, all machines in one of datacenters rebooted, not at the same time but pretty close) but I can't think of anything that could've done such permanent damage. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is t
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508830#comment-14508830 ] Marcus Eriksson commented on CASSANDRA-9045: Again, grasping for straws here [~r0mant] do you have a list of primary keys this has happened to? Could you correlate them to a (hopefully) sub-set of machines? In the provided traces nodes .151 and .76 seem to appear most often If there are corrupt bloom filters on one of the nodes for example, there is a possibility that we could drop tombstones when there is actual data in other sstables > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507483#comment-14507483 ] Tyler Hobbs commented on CASSANDRA-9045: Internally, we use Integer.MAX_VALUE for localDeletion to signify that there is no deletion. So, basically, there's not actually a tombstone in the memtable. Did you expect this data to be deleted? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507422#comment-14507422 ] Roman Tkachenko commented on CASSANDRA-9045: Ha, that's weird. In the another.txt - localDeletion is 2147483647, i.e. max int (year 2038 problem). While deletion info for some other cells has "proper" local deletion time. Why is that? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507344#comment-14507344 ] Marcus Eriksson commented on CASSANDRA-9045: Yeah, I think adding a bunch of debug output is the best idea for now. Question is how to log the right thing... I'll look at that tomorrow > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507339#comment-14507339 ] Roman Tkachenko commented on CASSANDRA-9045: I've got another tracing example I cannot explain - attached another.txt. It is with Tyler's additional debugging info. It shows memtable deletion info but yet returns the entry. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, another.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505219#comment-14505219 ] Tyler Hobbs commented on CASSANDRA-9045: We've still had no luck reproducing the issue. The only idea I have left is to try to add additional logging to compaction around tombstones. Perhaps we could limit it to a particular key to avoid outrageous log spam. What do you think, [~krummas]? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498237#comment-14498237 ] Roman Tkachenko commented on CASSANDRA-9045: Any news guys? Thanks. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487655#comment-14487655 ] Roman Tkachenko commented on CASSANDRA-9045: Thanks guys, I appreciate it. Let me know if I can help, I'm open to any ideas / suggestions at this point. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487644#comment-14487644 ] Tyler Hobbs commented on CASSANDRA-9045: We're still attempting to reproduce this. So far [~philipthompson] and I haven't had any luck, but I'm trying a few different things. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486239#comment-14486239 ] Roman Tkachenko commented on CASSANDRA-9045: Any luck / further suggestions on how to workaround this? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395248#comment-14395248 ] Tyler Hobbs commented on CASSANDRA-9045: Okay, so 44797 _was_ created by a compaction that finished shortly after the first request. Hmm. I think we're going to have to try harder to reproduce this, because adding debug logging around compaction probably isn't feasible here. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395138#comment-14395138 ] Roman Tkachenko commented on CASSANDRA-9045: These are the log lines I found that mention this SSTable: {code} INFO [CompactionExecutor:96] 2015-04-03 00:12:51,256 CompactionTask.java (line 296) Compacted 38 sstables to [/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44691,/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44797,/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44838,/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-44917,/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45038,/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45076,]. 2,024,901,266 bytes to 1,649,830,502 (~81% of original) in 262,455ms = 5.994936MB/s. 11,277 total partitions merged to 10,647. Partition merge counts were {1:10108, 2:557, 3:6, 7:2, 10:1, 13:1, } INFO [CompactionExecutor:153] 2015-04-03 00:26:51,990 CompactionTask.java (line 120) Compacting [SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45038-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45172-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45165-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45181-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45152-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44797-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44838-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-44917-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45164-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-44691-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45169-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45171-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45163-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45161-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45159-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45180-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45156-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45179-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45176-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45160-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45167-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45173-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45170-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45174-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45076-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45157-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45158-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45168-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45162-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45175-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45166-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45177-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45155-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45178-Data.db'), SSTableReader(path='/var/mailgun/sstables1/blackbook/bounces/blackbook-bounces-jb-45182-Data.db'), SSTableReader(path='/var/mailgun/sstables2/blackbook/bounces/blackbook-bounces-jb-45154-Data.db'), SSTableReader(path='/var/mailgun/sstables3/blackbook/bounces/blackbook-bounces-jb-45153-Data.db')] INFO [ValidationExecutor:5] 2015-04-03 00:30:42,861 SSTableReader.java (line 223) Opening /var/mai
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395113#comment-14395113 ] Tyler Hobbs commented on CASSANDRA-9045: Unfortunately, even with the extra tracing details, I'm still not sure what's going wrong. The strangest thing about the traces are the sstables that {{173.203.37.77}} uses. Before the repair, it reads from 44653, 43129, and 17876. After the repair, it reads from 44797 and 17876. What's interesting is that 44797 existed _before_ the repair (based on higher-generation bloom-filter skip entries in the first trace). I'm not sure why it didn't read from that sstable before the repair. [~r0mant] can you check the logs on that node to verify that the 44797 sstable existed before the first request? (Check for compaction logs that show it being created after 00:11:09.) > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more info
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393824#comment-14393824 ] Roman Tkachenko commented on CASSANDRA-9045: I ran "nodetool repair -pr blackbook bounces" on a couple of nodes. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393812#comment-14393812 ] Tyler Hobbs commented on CASSANDRA-9045: Thanks! What repair operation did you run in between the queries? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393803#comment-14393803 ] Roman Tkachenko commented on CASSANDRA-9045: I couldn't reproduce the "inconsistency.txt" issue but reproduced the original reported one. Take a look at the "debug.txt" - after repair the tombstone on 173.203.37.77 is gone. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: 9045-debug-tracing.txt, > apache-cassandra-2.0.13-SNAPSHOT.jar, cqlsh.txt, debug.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393627#comment-14393627 ] Tyler Hobbs commented on CASSANDRA-9045: bq. Would you be able to provide a binary that I could use as a drop-in replacement? I can do that if you need me to, although I can also provide simple instructions for building the jar. bq. Also, will I need to replace it on all nodes in the cluster? One node would be enough if you can reproduce the behavior like {{inconsistency.txt}} on a key that it's a replica for. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393596#comment-14393596 ] Roman Tkachenko commented on CASSANDRA-9045: We can try that. Would you be able to provide a binary that I could use as a drop-in replacement? Also, will I need to replace it on all nodes in the cluster? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393588#comment-14393588 ] Tyler Hobbs commented on CASSANDRA-9045: The digest mismatch indicates that the replicas returned different data to the coordinator. In your case (the cqlsh.txt traces), one replica returned a tombstone and the rest returned live data. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393568#comment-14393568 ] Roman Tkachenko commented on CASSANDRA-9045: Also, what is the "digest mismatch" that I'm getting in some tracing query logs? Can it be the reason of this weird behavior? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393564#comment-14393564 ] Roman Tkachenko commented on CASSANDRA-9045: Yeah, I understand "1 vs 3". What I'm confused about is "live vs tombstoned" because there were no deletes for this record within this 20 seconds interval. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393552#comment-14393552 ] Tyler Hobbs commented on CASSANDRA-9045: [~r0mant] since we're having no luck reproducing the issue, would you be willing to deploy a patched version of 2.0.13 with additional tracing entries if we create a patch? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393501#comment-14393501 ] Tyler Hobbs commented on CASSANDRA-9045: The differences in the cell counts in the trace (1 live vs 3 tombstoned) is due to the way that we count the cells internally. (Specifically, in {{ColumnCounter.GroupByPrefix.count()}}, we always increment {{ignored}} when we see a tombstoned cell without bothering to group by prefix.) Basically, there are three cells in the row, and when they're all live, they get counted as a single "cell" (in terms of the tracing message). So the trace messages are consistent with the entire row being live and then the entire row being deleted. The same sstables are being read for both queries, so the only thing that could have changed is the memtable contents. (This is different from the traces in cqlsh.txt, where all of the nodes had some kind of flush and compaction in between the reads, resulting in the newest sstable that gets read changing.) Is it possible that there were deletes to that partition in between the two reads in inconsistency.txt? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for th
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393337#comment-14393337 ] Roman Tkachenko commented on CASSANDRA-9045: Okay, thanks for letting me know! Nope, there were no compactions around that time. I ran repair on this node earlier this morning but the queries were performed some time after it was done. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393277#comment-14393277 ] Tyler Hobbs commented on CASSANDRA-9045: [~r0mant] we're currently working on reproducing the issue. Thanks for the additional info! That's pretty odd. I presume that there were no compactions for that table on 173.203.37.151 around the time of those queries? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393278#comment-14393278 ] Tyler Hobbs commented on CASSANDRA-9045: [~r0mant] we're currently working on reproducing the issue. Thanks for the additional info! That's pretty odd. I presume that there were no compactions for that table on 173.203.37.151 around the time of those queries? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393231#comment-14393231 ] Roman Tkachenko commented on CASSANDRA-9045: Also, check out the attached "inconsistency.txt" file. A request was issued twice with a difference of several seconds and the same server says "Read 1 live and 0 tombstoned cells" on the first run and "Read 0 live and 3 tombstoned cells" on the second run. How could that happen? I'm also getting inconsistent results (for this record and some others) even with LOCAL_QUORUM, when using our production app: one query returns the record, the next one does not. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt, inconsistency.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393075#comment-14393075 ] Roman Tkachenko commented on CASSANDRA-9045: Hey guys, any luck figuring out what might be happening here? We're getting hammered by our own customers about it... Thanks. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.15 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388845#comment-14388845 ] Philip Thompson commented on CASSANDRA-9045: Nope, no branch. I'll turn my scripts into a real test, and add the compaction step and more deletes to see what I can get to happen. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388840#comment-14388840 ] Tyler Hobbs commented on CASSANDRA-9045: Do you have a branch with the test somewhere? I believe you need to run a compaction after the "delete" step. You should perhaps also randomly delete a lot of rows (say, 10k) instead of just one (if that's what you're doing). I don't believe the repair step should be necessary to repro, if it's failing in the way I think it is. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387265#comment-14387265 ] Philip Thompson commented on CASSANDRA-9045: No, I absolutely didn't mean columns. I'm using CQL terminology here, not thrift. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387262#comment-14387262 ] Roman Tkachenko commented on CASSANDRA-9045: When you say "rows" do you actually mean "columns"? :) Did you guys see my comment about this affecting only certain columns and the "digest mismatch" thing? Can it be related? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386955#comment-14386955 ] Philip Thompson commented on CASSANDRA-9045: I am still unable to reproduce with the following worklow: Create a multiDC cluster, where in_memory_compaction_limit_in_mb = 1 set up a keyspace using NTS Create the same table as the reporter. Write 5M rows to a single partition. Flush Select a row from the partition Delete the row Repair on all nodes Select the row > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386805#comment-14386805 ] Philip Thompson commented on CASSANDRA-9045: This patch doesn't apply to 2.0 > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386271#comment-14386271 ] Marcus Eriksson commented on CASSANDRA-9045: [~philipthompson] yes, do that, with this patch you can set it to 0 even: http://aep.appspot.com/display/wSaOmJhJ6IGh0NYSe8-gY0sM4Yg/ > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384582#comment-14384582 ] Philip Thompson commented on CASSANDRA-9045: I wasn't able to reproduce with a partition containing 5M rows. Should I lower the in memory compaction limit and try again? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384118#comment-14384118 ] Roman Tkachenko commented on CASSANDRA-9045: And it also does not explain why those original "zombie" columns were finally purged when I increased in memory compaction limit setting. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384117#comment-14384117 ] Roman Tkachenko commented on CASSANDRA-9045: And it also does not explain why those original "zombie" columns were finally purged when I increased in memory compaction limit setting. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384107#comment-14384107 ] Roman Tkachenko commented on CASSANDRA-9045: Yep. I triple checked that clocks are synchronized. Also I checked multiple times that the schema version is the same for all nodes in the cluster. Were you able to reproduce the issue? FWIW I did some more research and it *seems* like it affects only certain columns in the row. Like, in my yesterday's test (the one I attached cqlsh output from) I also removed one more column and it did not reappear after repair. In tracing logs for it I did not see the "digest mismatch" thing, unlike the other one that did reappear. Not sure if it's related at all. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14384030#comment-14384030 ] Marcus Eriksson commented on CASSANDRA-9045: I grasping for straws here, but have you made sure that the clocks are synced on all the nodes? Are the nodes agreeing on schemas? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383926#comment-14383926 ] Roman Tkachenko commented on CASSANDRA-9045: I did: INFO [ValidationExecutor:8] 2015-03-26 18:53:41,404 CompactionController.java (line 192) Compacting large row blackbook/bounces:4ed558feba8a483733001d6a (279555898 bytes) incrementally > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383541#comment-14383541 ] Marcus Eriksson commented on CASSANDRA-9045: [~r0mant] did you see the "compacting large row"-message for the row you deleted in cqlsh.txt between 18:07 and 19:39? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382532#comment-14382532 ] Philip Thompson commented on CASSANDRA-9045: [~thobbs], this will be most meaningful to you. The Digest Mismatch seems interesting to me, how could that happen at CL=ALL for all operations? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382534#comment-14382534 ] Roman Tkachenko commented on CASSANDRA-9045: Forgot to mention that before the test I restored the original "in memory compaction limit" to the default 64MB so the row does not fit into this limit. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382531#comment-14382531 ] Roman Tkachenko commented on CASSANDRA-9045: I have attached an excerpt from cqlsh session showing "select -> delete -> select -> repair -> select" with tracing on. The very last select was issued after repair was done. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > Attachments: cqlsh.txt > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382365#comment-14382365 ] Tyler Hobbs commented on CASSANDRA-9045: It sounds to me like the incremental compaction is not processing range tombstones correctly, and it's purging the tombstone without purging the shadowed data. It also sounds like the range tombstone is being dropped before gc_grace has passed, so something is going pretty wrong. It seems like we should be able to reproduce this with a similar schema and similar deletes on a row that's above the in-memory compaction threshold. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Marcus Eriksson >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382359#comment-14382359 ] Philip Thompson commented on CASSANDRA-9045: After discussion with [~thobbs], seems like a problem with incremental compaction. Assigning to [~krummas] > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Philip Thompson >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382330#comment-14382330 ] Roman Tkachenko commented on CASSANDRA-9045: I'll run the test and try to get them to you. Not so sure about the logs though. I've enabled DEBUG and the node hasn't finished starting yet but has already produced ~1GB of logs. If you know how to enable debug mode just for repair/compaction components, let me know. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Philip Thompson >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382304#comment-14382304 ] Philip Thompson commented on CASSANDRA-9045: I'm very interested in the cqlsh traces for the delete and select queries. It doesn't seem like a repair issue, so I'm unassigning Yuki > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Philip Thompson >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382301#comment-14382301 ] Roman Tkachenko commented on CASSANDRA-9045: Repairs are definitely within gc_grace which is 10 days. A repair of a single node (nodetool repair blackbook bounce) takes about 1.5 hours. > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Yuki Morishita >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382298#comment-14382298 ] Roman Tkachenko commented on CASSANDRA-9045: Hi Philip - thanks for quick response. Yes, normally the delete is LOCAL_QUORUM, but in my tests I was using ALL as well, with the same results. Let me see if I can enable DEBUG logging and run repair again. That's gonna be a lot of logs, I imagine... > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Assignee: Yuki Morishita >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9045) Deleted columns are resurrected after repair in wide rows
[ https://issues.apache.org/jira/browse/CASSANDRA-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382278#comment-14382278 ] Philip Thompson commented on CASSANDRA-9045: What CL are you deleting at? Can you attach a system log of a node undergoing the repair? Possibly at DEBUG? > Deleted columns are resurrected after repair in wide rows > - > > Key: CASSANDRA-9045 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9045 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Roman Tkachenko >Priority: Critical > Fix For: 2.0.14 > > > Hey guys, > After almost a week of researching the issue and trying out multiple things > with (almost) no luck I was suggested (on the user@cass list) to file a > report here. > h5. Setup > Cassandra 2.0.13 (we had the issue with 2.0.10 as well and upgraded to see if > it goes away) > Multi datacenter 12+6 nodes cluster. > h5. Schema > {code} > cqlsh> describe keyspace blackbook; > CREATE KEYSPACE blackbook WITH replication = { > 'class': 'NetworkTopologyStrategy', > 'IAD': '3', > 'ORD': '3' > }; > USE blackbook; > CREATE TABLE bounces ( > domainid text, > address text, > message text, > "timestamp" bigint, > PRIMARY KEY (domainid, address) > ) WITH > bloom_filter_fp_chance=0.10 AND > caching='KEYS_ONLY' AND > comment='' AND > dclocal_read_repair_chance=0.10 AND > gc_grace_seconds=864000 AND > index_interval=128 AND > read_repair_chance=0.00 AND > populate_io_cache_on_flush='false' AND > default_time_to_live=0 AND > speculative_retry='99.0PERCENTILE' AND > memtable_flush_period_in_ms=0 AND > compaction={'class': 'LeveledCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > {code} > h5. Use case > Each row (defined by a domainid) can have many many columns (bounce entries) > so rows can get pretty wide. In practice, most of the rows are not that big > but some of them contain hundreds of thousands and even millions of columns. > Columns are not TTL'ed but can be deleted using the following CQL3 statement: > {code} > delete from bounces where domainid = 'domain.com' and address = > 'al...@example.com'; > {code} > All queries are performed using LOCAL_QUORUM CL. > h5. Problem > We weren't very diligent about running repairs on the cluster initially, but > shorty after we started doing it we noticed that some of previously deleted > columns (bounce entries) are there again, as if tombstones have disappeared. > I have run this test multiple times via cqlsh, on the row of the customer who > originally reported the issue: > * delete an entry > * verify it's not returned even with CL=ALL > * run repair on nodes that own this row's key > * the columns reappear and are returned even with CL=ALL > I tried the same test on another row with much less data and everything was > correctly deleted and didn't reappear after repair. > h5. Other steps I've taken so far > Made sure NTP is running on all servers and clocks are synchronized. > Increased gc_grace_seconds to 100 days, ran full repair (on the affected > keyspace) on all nodes, then changed it back to the default 10 days again. > Didn't help. > Performed one more test. Updated one of the resurrected columns, then deleted > it and ran repair again. This time the updated version of the column > reappeared. > Finally, I noticed these log entries for the row in question: > {code} > INFO [ValidationExecutor:77] 2015-03-25 20:27:43,936 > CompactionController.java (line 192) Compacting large row > blackbook/bounces:4ed558feba8a483733001d6a (279067683 bytes) incrementally > {code} > Figuring it may be related I bumped "in_memory_compaction_limit_in_mb" to > 512MB so the row fits into it, deleted the entry and ran repair once again. > The log entry for this row was gone and the columns didn't reappear. > We have a lot of rows much larger than 512MB so can't increase this > parameters forever, if that is the issue. > Please let me know if you need more information on the case or if I can run > more experiments. > Thanks! > Roman -- This message was sent by Atlassian JIRA (v6.3.4#6332)