[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2016-06-23 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346911#comment-15346911
 ] 

Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 6:22 PM:
-

Here is the relevant portion of the stress outputs:

CONTROL:
{code}
Results:
Op rate   :   21,050 op/s  [columndelete: 1,403 op/s, delete: 
702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s]
Partition rate:   17,540 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s]
Row rate  :   43,872 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s]
Latency mean  :2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, 
insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms]
Latency median:2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, 
insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms]
Latency 95th percentile   :3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, 
insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms]
Latency 99th percentile   :5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, 
insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms]
Latency 99.9th percentile :   75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, 
insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms]
Latency max   : 1032.3 ms [columndelete: 1,004.5 ms, delete: 
1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms]
Total partitions  : 378,840,394 [columndelete: 0, delete: 0, insert: 
227,304,928, read: 151,535,466, rowdelete: 0]
Total errors  :  0 [columndelete: 0, delete: 0, insert: 0, 
read: 0, rowdelete: 0]
Total GC count: 13,090
Total GC memory   : 15900.717 GiB
Total GC time :  998.3 seconds
Avg GC time   :   76.3 ms
StdDev GC time:   16.1 ms
Total operation time  : 05:59:58
{code}

NONE:
{code}
Results:
Op rate   :   20,729 op/s  [columndelete: 1,382 op/s, delete: 
690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s]
Partition rate:   17,274 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s]
Row rate  :   43,206 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s]
Latency mean  :2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, 
insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms]
Latency median:1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, 
insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms]
Latency 95th percentile   :3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, 
insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms]
Latency 99th percentile   :4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, 
insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms]
Latency 99.9th percentile :   47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, 
insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms]
Latency max   : 1055.6 ms [columndelete: 1,006.0 ms, delete: 
1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms]
Total partitions  : 373,111,699 [columndelete: 0, delete: 0, insert: 
223,905,059, read: 149,206,640, rowdelete: 0]
Total errors  :  0 [columndelete: 0, delete: 0, insert: 0, 
read: 0, rowdelete: 0]
Total GC count: 14,082
Total GC memory   : 17120.316 GiB
Total GC time : 1,005.7 seconds
Avg GC time   :   71.4 ms
StdDev GC time:   13.5 ms
Total operation time  : 06:00:00
{code}

ROW:
{code}
Results:
Op rate   :   16,121 op/s  [columndelete: 1,075 op/s, delete: 
538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s]
Partition rate:   13,432 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s]
Row rate  :   33,597 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s]
Latency mean  :3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, 
insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms]
Latency median:2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, 
insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms]
Latency 95th percentile   :5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, 
insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms]
Latency 99th percentile   :8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, 
insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms]
Latency 99.9th percentile :   51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, 
insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms]
Latency max   : 1053.8 ms [columndelete: 1,003.5 ms, delete: 
1,053.7 ms, insert

[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2016-06-23 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346911#comment-15346911
 ] 

Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 6:23 PM:
-

Here is the relevant portion of the stress outputs:

CONTROL:
{code}
Results:
Op rate   :   21,050 op/s  [columndelete: 1,403 op/s, delete: 
702 op/s, insert: 10,524 op/s, read: 7,017 op/s, rowdelete: 1,404 op/s]
Partition rate:   17,540 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 10,524 pk/s, read: 7,016 pk/s, rowdelete: 0 pk/s]
Row rate  :   43,872 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 26,324 row/s, read: 17,548 row/s, rowdelete: 0 row/s]
Latency mean  :2.4 ms [columndelete: 2.2 ms, delete: 2.1 ms, 
insert: 2.2 ms, read: 2.7 ms, rowdelete: 2.1 ms]
Latency median:2.0 ms [columndelete: 1.8 ms, delete: 1.8 ms, 
insert: 1.9 ms, read: 2.3 ms, rowdelete: 1.8 ms]
Latency 95th percentile   :3.6 ms [columndelete: 3.3 ms, delete: 3.3 ms, 
insert: 3.4 ms, read: 4.1 ms, rowdelete: 3.3 ms]
Latency 99th percentile   :5.5 ms [columndelete: 4.8 ms, delete: 4.8 ms, 
insert: 4.9 ms, read: 7.2 ms, rowdelete: 4.7 ms]
Latency 99.9th percentile :   75.2 ms [columndelete: 69.9 ms, delete: 69.9 ms, 
insert: 72.3 ms, read: 78.6 ms, rowdelete: 69.3 ms]
Latency max   : 1032.3 ms [columndelete: 1,004.5 ms, delete: 
1,004.5 ms, insert: 1,031.8 ms, read: 1,032.3 ms, rowdelete: 1,003.5 ms]
Total partitions  : 378,840,394 [columndelete: 0, delete: 0, insert: 
227,304,928, read: 151,535,466, rowdelete: 0]
Total errors  :  0 [columndelete: 0, delete: 0, insert: 0, 
read: 0, rowdelete: 0]
Total GC count: 13,090
Total GC memory   : 15900.717 GiB
Total GC time :  998.3 seconds
Avg GC time   :   76.3 ms
StdDev GC time:   16.1 ms
Total operation time  : 05:59:58
{code}

NONE:
{code}
Results:
Op rate   :   20,729 op/s  [columndelete: 1,382 op/s, delete: 
690 op/s, insert: 10,366 op/s, read: 6,909 op/s, rowdelete: 1,382 op/s]
Partition rate:   17,274 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 10,366 pk/s, read: 6,908 pk/s, rowdelete: 0 pk/s]
Row rate  :   43,206 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 25,929 row/s, read: 17,277 row/s, rowdelete: 0 row/s]
Latency mean  :2.4 ms [columndelete: 2.1 ms, delete: 2.1 ms, 
insert: 2.2 ms, read: 2.9 ms, rowdelete: 2.1 ms]
Latency median:1.9 ms [columndelete: 1.7 ms, delete: 1.7 ms, 
insert: 1.8 ms, read: 2.2 ms, rowdelete: 1.7 ms]
Latency 95th percentile   :3.3 ms [columndelete: 2.9 ms, delete: 2.9 ms, 
insert: 3.0 ms, read: 3.7 ms, rowdelete: 2.9 ms]
Latency 99th percentile   :4.7 ms [columndelete: 4.1 ms, delete: 4.1 ms, 
insert: 4.2 ms, read: 6.0 ms, rowdelete: 4.1 ms]
Latency 99.9th percentile :   47.6 ms [columndelete: 12.0 ms, delete: 13.0 ms, 
insert: 14.7 ms, read: 67.9 ms, rowdelete: 13.4 ms]
Latency max   : 1055.6 ms [columndelete: 1,006.0 ms, delete: 
1,004.4 ms, insert: 1,055.6 ms, read: 1,055.6 ms, rowdelete: 1,031.9 ms]
Total partitions  : 373,111,699 [columndelete: 0, delete: 0, insert: 
223,905,059, read: 149,206,640, rowdelete: 0]
Total errors  :  0 [columndelete: 0, delete: 0, insert: 0, 
read: 0, rowdelete: 0]
Total GC count: 14,082
Total GC memory   : 17120.316 GiB
Total GC time : 1,005.7 seconds
Avg GC time   :   71.4 ms
StdDev GC time:   13.5 ms
Total operation time  : 06:00:00
{code}

ROW:
{code}
Results:
Op rate   :   16,121 op/s  [columndelete: 1,075 op/s, delete: 
538 op/s, insert: 8,061 op/s, read: 5,372 op/s, rowdelete: 1,074 op/s]
Partition rate:   13,432 pk/s  [columndelete: 0 pk/s, delete: 0 
pk/s, insert: 8,061 pk/s, read: 5,371 pk/s, rowdelete: 0 pk/s]
Row rate  :   33,597 row/s [columndelete: 0 row/s, delete: 0 
row/s, insert: 20,165 row/s, read: 13,433 row/s, rowdelete: 0 row/s]
Latency mean  :3.1 ms [columndelete: 2.3 ms, delete: 2.3 ms, 
insert: 2.4 ms, read: 4.5 ms, rowdelete: 2.3 ms]
Latency median:2.3 ms [columndelete: 1.8 ms, delete: 1.7 ms, 
insert: 1.8 ms, read: 3.5 ms, rowdelete: 1.8 ms]
Latency 95th percentile   :5.3 ms [columndelete: 3.8 ms, delete: 3.8 ms, 
insert: 3.8 ms, read: 6.8 ms, rowdelete: 3.8 ms]
Latency 99th percentile   :8.4 ms [columndelete: 6.1 ms, delete: 6.2 ms, 
insert: 6.2 ms, read: 10.9 ms, rowdelete: 6.3 ms]
Latency 99.9th percentile :   51.4 ms [columndelete: 16.4 ms, delete: 15.7 ms, 
insert: 14.7 ms, read: 58.7 ms, rowdelete: 41.4 ms]
Latency max   : 1053.8 ms [columndelete: 1,003.5 ms, delete: 
1,053.7 ms, insert

[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2016-06-23 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346919#comment-15346919
 ] 

Philip Thompson edited comment on CASSANDRA-7019 at 6/23/16 7:01 PM:
-

It does appear this needs rebased onto trunk, especially to run dtest, as CCM 
has the expectation that 3.8+ has CDC. I've attempted to run dtest on this 
branch with an older CCM to compensate:

http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/7/

Also linking unit tests:
http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/


was (Author: philipthompson):
It does appear this needs rebased onto trunk, especially to run dtest, as CCM 
has the expectation that 3.8+ has CDC. I've attempted to run dtest on this 
branch with an older CCM to compensate:

http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-dtest/6/

Also linking unit tests:
http://cassci.datastax.com/view/Dev/view/blambov/job/blambov-7019-rebased-testall/lastCompletedBuild/testReport/

> Improve tombstone compactions
> -
>
> Key: CASSANDRA-7019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7019
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Marcus Eriksson
>Assignee: Branimir Lambov
>  Labels: compaction, fallout
> Fix For: 3.x
>
> Attachments: 7019-2-system.log, 7019-debug.log, cell.tar.gz, 
> control.tar.gz, none.tar.gz, row.tar.gz, temp-plot.html
>
>
> When there are no other compactions to do, we trigger a single-sstable 
> compaction if there is more than X% droppable tombstones in the sstable.
> In this ticket we should try to include overlapping sstables in those 
> compactions to be able to actually drop the tombstones. Might only be doable 
> with LCS (with STCS we would probably end up including all sstables)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2016-03-08 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184862#comment-15184862
 ] 

Marcus Eriksson edited comment on CASSANDRA-7019 at 3/8/16 12:47 PM:
-

* comment on Rows#removeShadowedCells needs updating
* in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated 
fixes (like the AbstractRow hashCode fix for example), should we break them out 
in a separate ticket? (so that we get them in 3.0 as well)
* why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>)

I pushed a few small fixes 
[here|https://github.com/krummas/cassandra/commits/blambov/7019-with-nodetool-command]
 as well

And I think we need to test these scenarios;
* how does nodetool garbagecollect work if there are 1000+ sstables?
* run a repair on a vnode cluster with 100+GB (that usually creates a lot of 
sstables)


was (Author: krummas):
* comment on Rows#removeShadowedCells needs updating
* in bde4c02cc858a900d43028b9a930e805ab232c27 there seems to be a few unrelated 
fixes (like the AbstractRow hashCode fix for example), should we break them out 
in a separate ticket? (so that we get them in 3.0 as well)
* why the FBUtilities.closeAll change? (going from Iterable<..> to List<..>)

I pushed a few small fixes 
[https://github.com/krummas/cassandra/commits/blambov/7019-with-nodetool-command|here]
 as well

And I think we need to test these scenarios;
* how does nodetool garbagecollect work if there are 1000+ sstables?
* run a repair on a vnode cluster with 100+GB (that usually creates a lot of 
sstables)

> Improve tombstone compactions
> -
>
> Key: CASSANDRA-7019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7019
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Compaction
>Reporter: Marcus Eriksson
>Assignee: Branimir Lambov
>  Labels: compaction
> Fix For: 3.x
>
>
> When there are no other compactions to do, we trigger a single-sstable 
> compaction if there is more than X% droppable tombstones in the sstable.
> In this ticket we should try to include overlapping sstables in those 
> compactions to be able to actually drop the tombstones. Might only be doable 
> with LCS (with STCS we would probably end up including all sstables)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2015-01-08 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14269022#comment-14269022
 ] 

Marcus Eriksson edited comment on CASSANDRA-7019 at 1/8/15 10:23 AM:
-

Updated titles and reopened 7272 - this ticket is about improving the 
single-sstable tombstone compactions while 7272 is adding major compaction to 
LCS


was (Author: krummas):
Updated titles and reopened 7272 - this ticket is about improving the 
single-sstable tombstone compactions while 7019 is adding major compaction to 
LCS

> Improve tombstone compactions
> -
>
> Key: CASSANDRA-7019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7019
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>  Labels: compaction
> Fix For: 3.0
>
>
> When there are no other compactions to do, we trigger a single-sstable 
> compaction if there is more than X% droppable tombstones in the sstable.
> In this ticket we should try to include overlapping sstables in those 
> compactions to be able to actually drop the tombstones. Might only be doable 
> with LCS (with STCS we would probably end up including all sstables)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2015-02-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307963#comment-14307963
 ] 

Björn Hegerfors edited comment on CASSANDRA-7019 at 2/5/15 8:58 PM:


I posted a related ticked some time ago, CASSANDRA-8359. In particular, the 
side note at the end is essentially this ticket exactly, for DTCS. A solution 
to this ticket may or may not solve the main issue in that ticket, but that's a 
matter for that ticket.

Since DTCS SSTables are (supposed to be) separated into time windows, we have 
the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it 
seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables 
should always target the _n_ oldest ones. The oldest one alone is practically 
guaranteed to overlap with any other SSTable, in terms of tokens. So picking 
the right SSTables for multi-tombstone compaction should be as easy as sorting 
by age (min timestamp), taking the oldest one, and include the newer ones in 
succession, checking at which point the tombstone ratio is the highest. Or 
something close to that, anyway. Then we might as well write them back as a 
single SSTable, I don't see why not.

EDIT: moved the all of the below to CASSANDRA-7272, where it belongs.

-As for the STCS case, I don't understand why major compaction for STCS isn't 
already optimal. I do see why one might want to compact some but not all 
SSTables in a multi-tombstone compaction (though DTCS should be a better fit 
for anyone wanting this). But if every single SSTable is being rewritten to 
disk, why not write them into one file? As far as I understand, the ultimate 
goal of STCS is to be one SSTable. STCS only gets there, the natural way, once 
in a blue moon. But that's the most optimal state that it can be in. Am I 
wrong?-

-The only explanation I can see for splitting the result of compacting all 
SSTables into fragments, is if those fragments are:-
-1. Partitioned smartly. For example into separate token ranges (à la LCS), 
timestamp ranges (à la DTCS) or clustering column ranges (which would be 
interesting). Or a combination of these.-
-2. The structure upheld by the resulting fragments is not subsequently 
demolished by the running compaction strategy going on with its usual business.-


was (Author: bj0rn):
I posted a related ticked some time ago, CASSANDRA-8359. In particular, the 
side note at the end is essentially this ticket exactly, for DTCS. A solution 
to this ticket may or may not solve the main issue in that ticket, but that's a 
matter for that ticket.

Since DTCS SSTables are (supposed to be) separated into time windows, we have 
the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it 
seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables 
should always target the _n_ oldest ones. The oldest one alone is practically 
guaranteed to overlap with any other SSTable, in terms of tokens. So picking 
the right SSTables for multi-tombstone compaction should be as easy as sorting 
by age (min timestamp), taking the oldest one, and include the newer ones in 
succession, checking at which point the tombstone ratio is the highest. Or 
something close to that, anyway. Then we might as well write them back as a 
single SSTable, I don't see why not.

As for the STCS case, I don't understand why major compaction for STCS isn't 
already optimal. I do see why one might want to compact some but not all 
SSTables in a multi-tombstone compaction (though DTCS should be a better fit 
for anyone wanting this). But if every single SSTable is being rewritten to 
disk, why not write them into one file? As far as I understand, the ultimate 
goal of STCS is to be one SSTable. STCS only gets there, the natural way, once 
in a blue moon. But that's the most optimal state that it can be in. Am I wrong?

The only explanation I can see for splitting the result of compacting all 
SSTables into fragments, is if those fragments are:
1. Partitioned smartly. For example into separate token ranges (à la LCS), 
timestamp ranges (à la DTCS) or clustering column ranges (which would be 
interesting). Or a combination of these.
2. The structure upheld by the resulting fragments is not subsequently 
demolished by the running compaction strategy going on with its usual business.

> Improve tombstone compactions
> -
>
> Key: CASSANDRA-7019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7019
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Branimir Lambov
>  Labels: compaction
> Fix For: 3.0
>
>
> When there are no other compactions to do, we trigger a single-sstable 
> compaction if there is more than X% droppable tombstones in the sstab

[jira] [Comment Edited] (CASSANDRA-7019) Improve tombstone compactions

2015-02-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14307963#comment-14307963
 ] 

Björn Hegerfors edited comment on CASSANDRA-7019 at 2/5/15 8:59 PM:


I posted a related ticked some time ago, CASSANDRA-8359. In particular, the 
side note at the end is essentially this ticket exactly, for DTCS. A solution 
to this ticket may or may not solve the main issue in that ticket, but that's a 
matter for that ticket.

Since DTCS SSTables are (supposed to be) separated into time windows, we have 
the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it 
seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables 
should always target the _n_ oldest ones. The oldest one alone is practically 
guaranteed to overlap with any other SSTable, in terms of tokens. So picking 
the right SSTables for multi-tombstone compaction should be as easy as sorting 
by age (min timestamp), taking the oldest one, and include the newer ones in 
succession, checking at which point the tombstone ratio is the highest. Or 
something close to that, anyway. Then we might as well write them back as a 
single SSTable, I don't see why not.

EDIT: moved the following to CASSANDRA-7272, where it belongs.

-As for the STCS case, I don't understand why major compaction for STCS isn't 
already optimal. I do see why one might want to compact some but not all 
SSTables in a multi-tombstone compaction (though DTCS should be a better fit 
for anyone wanting this). But if every single SSTable is being rewritten to 
disk, why not write them into one file? As far as I understand, the ultimate 
goal of STCS is to be one SSTable. STCS only gets there, the natural way, once 
in a blue moon. But that's the most optimal state that it can be in. Am I 
wrong?-

-The only explanation I can see for splitting the result of compacting all 
SSTables into fragments, is if those fragments are:-
-1. Partitioned smartly. For example into separate token ranges (à la LCS), 
timestamp ranges (à la DTCS) or clustering column ranges (which would be 
interesting). Or a combination of these.-
-2. The structure upheld by the resulting fragments is not subsequently 
demolished by the running compaction strategy going on with its usual business.-


was (Author: bj0rn):
I posted a related ticked some time ago, CASSANDRA-8359. In particular, the 
side note at the end is essentially this ticket exactly, for DTCS. A solution 
to this ticket may or may not solve the main issue in that ticket, but that's a 
matter for that ticket.

Since DTCS SSTables are (supposed to be) separated into time windows, we have 
the concept of an _oldest_ SSTable in a way that we don't with STCS. To me it 
seems pretty clear that a multi-SSTable tombstone compaction on _n_ SSTables 
should always target the _n_ oldest ones. The oldest one alone is practically 
guaranteed to overlap with any other SSTable, in terms of tokens. So picking 
the right SSTables for multi-tombstone compaction should be as easy as sorting 
by age (min timestamp), taking the oldest one, and include the newer ones in 
succession, checking at which point the tombstone ratio is the highest. Or 
something close to that, anyway. Then we might as well write them back as a 
single SSTable, I don't see why not.

EDIT: moved the all of the below to CASSANDRA-7272, where it belongs.

-As for the STCS case, I don't understand why major compaction for STCS isn't 
already optimal. I do see why one might want to compact some but not all 
SSTables in a multi-tombstone compaction (though DTCS should be a better fit 
for anyone wanting this). But if every single SSTable is being rewritten to 
disk, why not write them into one file? As far as I understand, the ultimate 
goal of STCS is to be one SSTable. STCS only gets there, the natural way, once 
in a blue moon. But that's the most optimal state that it can be in. Am I 
wrong?-

-The only explanation I can see for splitting the result of compacting all 
SSTables into fragments, is if those fragments are:-
-1. Partitioned smartly. For example into separate token ranges (à la LCS), 
timestamp ranges (à la DTCS) or clustering column ranges (which would be 
interesting). Or a combination of these.-
-2. The structure upheld by the resulting fragments is not subsequently 
demolished by the running compaction strategy going on with its usual business.-

> Improve tombstone compactions
> -
>
> Key: CASSANDRA-7019
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7019
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Marcus Eriksson
>Assignee: Branimir Lambov
>  Labels: compaction
> Fix For: 3.0
>
>
> When there are no other compactions to do, we trigger a single-sstable