[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346911#comment-15346911 ] Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 6:22 PM: - Here is the relevant portion of the stress outputs: CONTROL: {code} Results: Op rate : 21,050 op/s [columndelete: 1,403 op/s, delete: 702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s] Partition rate: 17,540 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s] Row rate : 43,872 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s] Latency mean :2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms] Latency median:2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms] Latency 95th percentile :3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms] Latency 99th percentile :5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms] Latency 99.9th percentile : 75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms] Latency max : 1032.3 ms [columndelete: 1,004.5 ms, delete: 1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms] Total partitions : 378,840,394 [columndelete: 0, delete: 0, insert: 227,304,928, read: 151,535,466, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count: 13,090 Total GC memory : 15900.717 GiB Total GC time : 998.3 seconds Avg GC time : 76.3 ms StdDev GC time: 16.1 ms Total operation time : 05:59:58 {code} NONE: {code} Results: Op rate : 20,729 op/s [columndelete: 1,382 op/s, delete: 690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s] Partition rate: 17,274 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s] Row rate : 43,206 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s] Latency mean :2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms] Latency median:1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms] Latency 95th percentile :3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms] Latency 99th percentile :4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms] Latency 99.9th percentile : 47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms] Latency max : 1055.6 ms [columndelete: 1,006.0 ms, delete: 1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms] Total partitions : 373,111,699 [columndelete: 0, delete: 0, insert: 223,905,059, read: 149,206,640, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count: 14,082 Total GC memory : 17120.316 GiB Total GC time : 1,005.7 seconds Avg GC time : 71.4 ms StdDev GC time: 13.5 ms Total operation time : 06:00:00 {code} ROW: {code} Results: Op rate : 16,121 op/s [columndelete: 1,075 op/s, delete: 538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s] Partition rate: 13,432 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s] Row rate : 33,597 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s] Latency mean :3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms] Latency median:2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms] Latency 95th percentile :5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms] Latency 99th percentile :8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms] Latency 99.9th percentile : 51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms] Latency max : 1053.8 ms [columndelete: 1,003.5 ms, delete: 1,053.7 ms, insert
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346911#comment-15346911 ] Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 6:23 PM: - Here is the relevant portion of the stress outputs: CONTROL: {code} Results: Op rate : 21,050 op/s [columndelete: 1,403 op/s, delete: 702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s] Partition rate: 17,540 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s] Row rate : 43,872 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s] Latency mean :2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms] Latency median:2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms] Latency 95th percentile :3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms] Latency 99th percentile :5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms] Latency 99.9th percentile : 75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms] Latency max : 1032.3 ms [columndelete: 1,004.5 ms, delete: 1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms] Total partitions : 378,840,394 [columndelete: 0, delete: 0, insert: 227,304,928, read: 151,535,466, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count: 13,090 Total GC memory : 15900.717 GiB Total GC time : 998.3 seconds Avg GC time : 76.3 ms StdDev GC time: 16.1 ms Total operation time : 05:59:58 {code} NONE: {code} Results: Op rate : 20,729 op/s [columndelete: 1,382 op/s, delete: 690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s] Partition rate: 17,274 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s] Row rate : 43,206 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s] Latency mean :2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms] Latency median:1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms] Latency 95th percentile :3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms] Latency 99th percentile :4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms] Latency 99.9th percentile : 47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms] Latency max : 1055.6 ms [columndelete: 1,006.0 ms, delete: 1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms] Total partitions : 373,111,699 [columndelete: 0, delete: 0, insert: 223,905,059, read: 149,206,640, rowdelete: 0] Total errors : 0 [columndelete: 0, delete: 0, insert: 0, read: 0, rowdelete: 0] Total GC count: 14,082 Total GC memory : 17120.316 GiB Total GC time : 1,005.7 seconds Avg GC time : 71.4 ms StdDev GC time: 13.5 ms Total operation time : 06:00:00 {code} ROW: {code} Results: Op rate : 16,121 op/s [columndelete: 1,075 op/s, delete: 538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s] Partition rate: 13,432 pk/s [columndelete: 0 pk/s, delete: 0 pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s] Row rate : 33,597 row/s [columndelete: 0 row/s, delete: 0 row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s] Latency mean :3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms] Latency median:2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms] Latency 95th percentile :5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms] Latency 99th percentile :8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms] Latency 99.9th percentile : 51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms] Latency max : 1053.8 ms [columndelete: 1,003.5 ms, delete: 1,053.7 ms, insert
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346919#comment-15346919 ] Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 7:01 PM: - It does appear this needs rebased onto trunk, especially to run dtest, as CCM has the expectation that 3.8+ has CDC. I've attempted to run dtest on this branch with an older CCM to compensate: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/7/ Also linking unit tests: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/ was (Author: philipthompson): It does appear this needs rebased onto trunk, especially to run dtest, as CCM has the expectation that 3.8+ has CDC. I've attempted to run dtest on this branch with an older CCM to compensate: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/6/ Also linking unit tests: http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/ > Improve tombstone compactions > - > > Key: CASSANDRA-7019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7019 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Marcus Eriksson >Assignee: Branimir Lambov > Labels: compaction, fallout > Fix For: 3.x > > Attachments: 7019-2-system.log, 7019-debug.log, cell.tar.gz, > control.tar.gz, none.tar.gz, row.tar.gz, temp-plot.html > > > When there are no other compactions to do, we trigger a single-sstable > compaction if there is more than X% droppable tombstones in the sstable. > In this ticket we should try to include overlapping sstables in those > compactions to be able to actually drop the tombstones. Might only be doable > with LCS (with STCS we would probably end up including all sstables) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184862#comment-15184862 ] Marcus Eriksson edited comment on CASSANDRA-7019 at 3/8/16 12:47 PM: - * comment on Rows#removeShadowedCells needs updating * in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well) * why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>) I pushed a few small fixes [here|https://github.com/krummas/cassandra/commits/blambov/7019-with-nodetool-command] as well And I think we need to test these scenarios; * how does nodetool garbagecollect work if there are 1000+ sstables? * run a repair on a vnode cluster with 100+GB (that usually creates a lot of sstables) was (Author: krummas): * comment on Rows#removeShadowedCells needs updating * in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated fixes (like the AbstractRow hashCode fix for example), should we break them out in a separate ticket? (so that we get them in 3.0 as well) * why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>) I pushed a few small fixes [https://github.com/krummas/cassandra/commits/blambov/7019-with-nodetool-command|here] as well And I think we need to test these scenarios; * how does nodetool garbagecollect work if there are 1000+ sstables? * run a repair on a vnode cluster with 100+GB (that usually creates a lot of sstables) > Improve tombstone compactions > - > > Key: CASSANDRA-7019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7019 > Project: Cassandra > Issue Type: Improvement > Components: Compaction >Reporter: Marcus Eriksson >Assignee: Branimir Lambov > Labels: compaction > Fix For: 3.x > > > When there are no other compactions to do, we trigger a single-sstable > compaction if there is more than X% droppable tombstones in the sstable. > In this ticket we should try to include overlapping sstables in those > compactions to be able to actually drop the tombstones. Might only be doable > with LCS (with STCS we would probably end up including all sstables) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269022#comment-14269022 ] Marcus Eriksson edited comment on CASSANDRA-7019 at 1/8/15 10:23 AM: - Updated titles and reopened 7272 - this ticket is about improving the single-sstable tombstone compactions while 7272 is adding major compaction to LCS was (Author: krummas): Updated titles and reopened 7272 - this ticket is about improving the single-sstable tombstone compactions while 7019 is adding major compaction to LCS > Improve tombstone compactions > - > > Key: CASSANDRA-7019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7019 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson > Labels: compaction > Fix For: 3.0 > > > When there are no other compactions to do, we trigger a single-sstable > compaction if there is more than X% droppable tombstones in the sstable. > In this ticket we should try to include overlapping sstables in those > compactions to be able to actually drop the tombstones. Might only be doable > with LCS (with STCS we would probably end up including all sstables) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307963#comment-14307963 ] Björn Hegerfors edited comment on CASSANDRA-7019 at 2/5/15 8:58 PM: I posted a related ticked some time ago, CASSANDRA-8359. In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket. Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables should always target the _n_ oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not. EDIT: moved the all of the below to CASSANDRA-7272, where it belongs. -As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong?- -The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are:- -1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these.- -2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business.- was (Author: bj0rn): I posted a related ticked some time ago, CASSANDRA-8359. In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket. Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables should always target the _n_ oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not. As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong? The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are: 1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these. 2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business. > Improve tombstone compactions > - > > Key: CASSANDRA-7019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7019 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Branimir Lambov > Labels: compaction > Fix For: 3.0 > > > When there are no other compactions to do, we trigger a single-sstable > compaction if there is more than X% droppable tombstones in the sstab
[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions
[ https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307963#comment-14307963 ] Björn Hegerfors edited comment on CASSANDRA-7019 at 2/5/15 8:59 PM: I posted a related ticked some time ago, CASSANDRA-8359. In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket. Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables should always target the _n_ oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not. EDIT: moved the following to CASSANDRA-7272, where it belongs. -As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong?- -The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are:- -1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these.- -2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business.- was (Author: bj0rn): I posted a related ticked some time ago, CASSANDRA-8359. In particular, the side note at the end is essentially this ticket exactly, for DTCS. A solution to this ticket may or may not solve the main issue in that ticket, but that's a matter for that ticket. Since DTCS SSTables are (supposed to be) separated into time windows, we have the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables should always target the _n_ oldest ones. The oldest one alone is practically guaranteed to overlap with any other SSTable, in terms of tokens. So picking the right SSTables for multi-tombstone compaction should be as easy as sorting by age (min timestamp), taking the oldest one, and include the newer ones in succession, checking at which point the tombstone ratio is the highest. Or something close to that, anyway. Then we might as well write them back as a single SSTable, I don't see why not. EDIT: moved the all of the below to CASSANDRA-7272, where it belongs. -As for the STCS case, I don't understand why major compaction for STCS isn't already optimal. I do see why one might want to compact some but not all SSTables in a multi-tombstone compaction (though DTCS should be a better fit for anyone wanting this). But if every single SSTable is being rewritten to disk, why not write them into one file? As far as I understand, the ultimate goal of STCS is to be one SSTable. STCS only gets there, the natural way, once in a blue moon. But that's the most optimal state that it can be in. Am I wrong?- -The only explanation I can see for splitting the result of compacting all SSTables into fragments, is if those fragments are:- -1. Partitioned smartly. For example into separate token ranges (à la LCS), timestamp ranges (à la DTCS) or clustering column ranges (which would be interesting). Or a combination of these.- -2. The structure upheld by the resulting fragments is not subsequently demolished by the running compaction strategy going on with its usual business.- > Improve tombstone compactions > - > > Key: CASSANDRA-7019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7019 > Project: Cassandra > Issue Type: Improvement >Reporter: Marcus Eriksson >Assignee: Branimir Lambov > Labels: compaction > Fix For: 3.0 > > > When there are no other compactions to do, we trigger a single-sstable