[jira] [Comment Edited] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

Alex Petrov (JIRA) Tue, 29 May 2018 04:38:29 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493423#comment-16493423
 ]


Alex Petrov edited comment on CASSANDRA-12526 at 5/29/18 11:37 AM:
-------------------------------------------------------------------

Thank you for the great patch and sorry for the long time to review.

I have mostly small nits/comments as patch looks solid and so far all testing 
was yielding good results.

  * In 
[metadataChanged|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-c24601ca8b77db9628351c9c8ac83979R299]
 and [SSTableMetadataChanged 
notification|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-6cf05413c8d72c15fbbd512ce21ddca0R28]
 it might make sense to add new metadata along with the old metadata.
  * Should we add a (possibly smaller/shorter) negative test? E.g. make sure 
that under same conditions as in {{compactionTest}} but with 
{{single_sstable_uplevel: false}} we get a "normal" compaction task instead.
  * there are shortcuts for {{getDefaultCFS().disableAutoCompaction()}} and 
{{getDefaultCFS().forceBlockingFlush()}} in {{CQLTester}}. There are multiple 
places, for example 
[here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-39f5a435ed2a85b43174405802edcdbaR75]
 we could use those. 
  * 
[here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-403d518f40817cabdab5449071a41b50R165]
 we could re-order conditionals and short-circuit by {{singleSSTableUplevel}}, 
since it if this feature isn't on we won't ever get to the second clause. 
  * Might be good to add a short comment or indicate in the name that 
{{SingleSSTableLCSTask}} is kind of no-op (doesn't perform a real compaction. 
  * We've also discussed offline that 
[mutateLevel|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-d6d1a843ef8c25484d740d82a4746644R75]
 is safe here since sstables are marked as compacting and file rename is atomic.

Let me know what you think.


was (Author: ifesdjeen):
Thank you for the great patch and sorry for the long time to review.

I have mostly small nits/comments as patch looks solid and so far all testing 
was yielding good results.

  * In 
[metadataChanged|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-c24601ca8b77db9628351c9c8ac83979R299]
 and [SSTableMetadataChanged 
notification|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-6cf05413c8d72c15fbbd512ce21ddca0R28]
 it might make sense to add new metadata along with the old metadata.
  * Should we add a (possibly smaller/shorter) negative test? E.g. make sure 
that under same conditions as in {{compactionTest}} but with 
{{single_sstable_uplevel: false}} we get a "normal" compaction task instead.
  * there are shortcuts for {{getDefaultCFS().disableAutoCompaction()}} and 
{{getDefaultCFS().forceBlockingFlush()}} in {{CQLTester}}. There are multiple 
places, for example 
[here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-39f5a435ed2a85b43174405802edcdbaR75]
 we could use those. 
  * 
[here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-403d518f40817cabdab5449071a41b50R165]
 we could re-order conditionals and short-circuit by {{singleSSTableUplevel}}, 
since it if this feature isn't on we won't ever get to the second clause. 
  * Might be good to add a short comment or indicate in the name that 
{{SingleSSTableLCSTask}} is kind of no-op (doesn't perform a real compaction. 
  * We've also discussed offline that 
[mutateLevel|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-d6d1a843ef8c25484d740d82a4746644R75]
 is safe here since sstables are marked as compacting and file rename is atomic.

Let me know what you think.

> For LCS, single SSTable up-level is handled inefficiently
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-12526
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12526
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Compaction
>            Reporter: Wei Deng
>            Assignee: Marcus Eriksson
>            Priority: Major
>              Labels: compaction, lcs, performance
>             Fix For: 4.x
>
>
> I'm using the latest trunk (as of August 2016, which probably is going to be 
> 3.10) to run some experiments on LeveledCompactionStrategy and noticed this 
> inefficiency.
> The test data is generated using cassandra-stress default parameters 
> (keyspace1.standard1), so as you can imagine, it consists of a ton of newly 
> inserted partitions that will never merge in compactions, which is probably 
> the worst kind of workload for LCS (however, I'll detail later why this 
> scenario should not be ignored as a corner case; for now, let's just assume 
> we still want to handle this scenario efficiently).
> After the compaction test is done, I scrubbed debug.log for patterns that 
> match  the "Compacted" summary so that I can see how long each individual 
> compaction took and how many bytes they processed. The search pattern is like 
> the following:
> {noformat}
> grep 'Compacted.*standard1' debug.log
> {noformat}
> Interestingly, I noticed a lot of the finished compactions are marked as 
> having *only one* SSTable involved. With the workload mentioned above, the 
> "single SSTable" compactions actually consist of the majority of all 
> compactions (as shown below), so its efficiency can affect the overall 
> compaction throughput quite a bit.
> {noformat}
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | wc -l
> 243
> automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' 
> debug.log-test1 | grep ") 1 sstable" | wc -l
> 218
> {noformat}
> By looking at the code, it appears that there's a way to directly edit the 
> level of a particular SSTable like the following:
> {code}
> sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, 
> targetLevel);
> sstable.reloadSSTableMetadata();
> {code}
> To be exact, I summed up the time spent for these single-SSTable compactions 
> (the total data size is 60GB) and found that if each compaction only needs to 
> spend 100ms for only the metadata change (instead of the 10+ second they're 
> doing now), it can already achieve 22.75% saving on total compaction time.
> Compared to what we have now (reading the whole single-SSTable from old level 
> and writing out the same single-SSTable at the new level), the only 
> difference I could think of by using this approach is that the new SSTable 
> will have the same file name (sequence number) as the old one's, which could 
> break some assumptions on some other part of the code. However, not having to 
> go through the full read/write IO, and not having to bear the overhead of 
> cleaning up the old file, creating the new file, creating more churns in heap 
> and file buffer, it seems the benefits outweigh the inconvenience. So I'd 
> argue this JIRA belongs to LHF and should be made available in 3.0.x as well.
> As mentioned in the 2nd paragraph, I'm also going to address why this kind of 
> all-new-partition workload should not be ignored as a corner case. Basically, 
> for the main use case of LCS where you need to frequently merge partitions to 
> optimize read and eliminate tombstones and expired data sooner, LCS can be 
> perfectly happy and efficiently perform the partition merge and tombstone 
> elimination for a long time. However, as soon as the node becomes a bit 
> unhealthy for various reasons (could be a bad disk so it's missing a whole 
> bunch of mutations and need repair, could be the user chooses to ingest way 
> more data than it usually takes and exceeds its capability, or god-forbidden, 
> some DBA chooses to run offline sstablelevelreset), you will have to handle 
> this kind of "all-new-partition with a lot of SSTables in L0" scenario, and 
> once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot 
> of such single-SSTable compactions, which is the situation this JIRA is 
> intended to address.
> Actually, when I think more about this, to make this kind of single SSTable 
> up-level more efficient will not only help the all-new-partition scenario, 
> but also help in general any time when there is a big backlog of L0 SSTables 
> due to too many flushes or excessive repair streaming with vnode. In those 
> situations, by default STCS_in_L0 will be triggered, and you will end up 
> getting a bunch of much bigger L0 SSTables after STCS is done. When it's time 
> to up-level those much bigger L0 SSTables most likely they will overlap among 
> themselves and you will add them all into your compaction session (along with 
> all overlapped L1 SSTables). For these much bigger L0 SSTables, they have 
> gone through a few rounds of STCS compactions, so if there's partition merge 
> that needs to be done because fragments of the same partition are dispersed 
> in smaller L0 SSTables earlier, after those STCS rounds, what you end up 
> having in those much bigger L0 SSTables (generated by STCS) will not have 
> much more opportunity for partition merge to happen, so we're in a scenario 
> very similar to L0 data "consists of a ton of newly inserted partitions that 
> will never merge in compactions" mentioned earlier.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-12526) For LCS, single SSTable up-level is handled inefficiently

Reply via email to