[ https://issues.apache.org/jira/browse/CASSANDRA-12526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493423#comment-16493423 ]
Alex Petrov edited comment on CASSANDRA-12526 at 5/29/18 11:37 AM: ------------------------------------------------------------------- Thank you for the great patch and sorry for the long time to review. I have mostly small nits/comments as patch looks solid and so far all testing was yielding good results. * In [metadataChanged|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-c24601ca8b77db9628351c9c8ac83979R299] and [SSTableMetadataChanged notification|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-6cf05413c8d72c15fbbd512ce21ddca0R28] it might make sense to add new metadata along with the old metadata. * Should we add a (possibly smaller/shorter) negative test? E.g. make sure that under same conditions as in {{compactionTest}} but with {{single_sstable_uplevel: false}} we get a "normal" compaction task instead. * there are shortcuts for {{getDefaultCFS().disableAutoCompaction()}} and {{getDefaultCFS().forceBlockingFlush()}} in {{CQLTester}}. There are multiple places, for example [here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-39f5a435ed2a85b43174405802edcdbaR75] we could use those. * [here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-403d518f40817cabdab5449071a41b50R165] we could re-order conditionals and short-circuit by {{singleSSTableUplevel}}, since it if this feature isn't on we won't ever get to the second clause. * Might be good to add a short comment or indicate in the name that {{SingleSSTableLCSTask}} is kind of no-op (doesn't perform a real compaction. * We've also discussed offline that [mutateLevel|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-d6d1a843ef8c25484d740d82a4746644R75] is safe here since sstables are marked as compacting and file rename is atomic. Let me know what you think. was (Author: ifesdjeen): Thank you for the great patch and sorry for the long time to review. I have mostly small nits/comments as patch looks solid and so far all testing was yielding good results. * In [metadataChanged|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-c24601ca8b77db9628351c9c8ac83979R299] and [SSTableMetadataChanged notification|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-6cf05413c8d72c15fbbd512ce21ddca0R28] it might make sense to add new metadata along with the old metadata. * Should we add a (possibly smaller/shorter) negative test? E.g. make sure that under same conditions as in {{compactionTest}} but with {{single_sstable_uplevel: false}} we get a "normal" compaction task instead. * there are shortcuts for {{getDefaultCFS().disableAutoCompaction()}} and {{getDefaultCFS().forceBlockingFlush()}} in {{CQLTester}}. There are multiple places, for example [here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-39f5a435ed2a85b43174405802edcdbaR75] we could use those. * [here|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-403d518f40817cabdab5449071a41b50R165] we could re-order conditionals and short-circuit by {{singleSSTableUplevel}}, since it if this feature isn't on we won't ever get to the second clause. * Might be good to add a short comment or indicate in the name that {{SingleSSTableLCSTask}} is kind of no-op (doesn't perform a real compaction. * We've also discussed offline that [mutateLevel|https://github.com/apache/cassandra/compare/trunk...krummas:marcuse/12526#diff-d6d1a843ef8c25484d740d82a4746644R75] is safe here since sstables are marked as compacting and file rename is atomic. Let me know what you think. > For LCS, single SSTable up-level is handled inefficiently > --------------------------------------------------------- > > Key: CASSANDRA-12526 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12526 > Project: Cassandra > Issue Type: Improvement > Components: Compaction > Reporter: Wei Deng > Assignee: Marcus Eriksson > Priority: Major > Labels: compaction, lcs, performance > Fix For: 4.x > > > I'm using the latest trunk (as of August 2016, which probably is going to be > 3.10) to run some experiments on LeveledCompactionStrategy and noticed this > inefficiency. > The test data is generated using cassandra-stress default parameters > (keyspace1.standard1), so as you can imagine, it consists of a ton of newly > inserted partitions that will never merge in compactions, which is probably > the worst kind of workload for LCS (however, I'll detail later why this > scenario should not be ignored as a corner case; for now, let's just assume > we still want to handle this scenario efficiently). > After the compaction test is done, I scrubbed debug.log for patterns that > match the "Compacted" summary so that I can see how long each individual > compaction took and how many bytes they processed. The search pattern is like > the following: > {noformat} > grep 'Compacted.*standard1' debug.log > {noformat} > Interestingly, I noticed a lot of the finished compactions are marked as > having *only one* SSTable involved. With the workload mentioned above, the > "single SSTable" compactions actually consist of the majority of all > compactions (as shown below), so its efficiency can affect the overall > compaction throughput quite a bit. > {noformat} > automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' > debug.log-test1 | wc -l > 243 > automaton@0ce59d338-1:~/cassandra-trunk/logs$ grep 'Compacted.*standard1' > debug.log-test1 | grep ") 1 sstable" | wc -l > 218 > {noformat} > By looking at the code, it appears that there's a way to directly edit the > level of a particular SSTable like the following: > {code} > sstable.descriptor.getMetadataSerializer().mutateLevel(sstable.descriptor, > targetLevel); > sstable.reloadSSTableMetadata(); > {code} > To be exact, I summed up the time spent for these single-SSTable compactions > (the total data size is 60GB) and found that if each compaction only needs to > spend 100ms for only the metadata change (instead of the 10+ second they're > doing now), it can already achieve 22.75% saving on total compaction time. > Compared to what we have now (reading the whole single-SSTable from old level > and writing out the same single-SSTable at the new level), the only > difference I could think of by using this approach is that the new SSTable > will have the same file name (sequence number) as the old one's, which could > break some assumptions on some other part of the code. However, not having to > go through the full read/write IO, and not having to bear the overhead of > cleaning up the old file, creating the new file, creating more churns in heap > and file buffer, it seems the benefits outweigh the inconvenience. So I'd > argue this JIRA belongs to LHF and should be made available in 3.0.x as well. > As mentioned in the 2nd paragraph, I'm also going to address why this kind of > all-new-partition workload should not be ignored as a corner case. Basically, > for the main use case of LCS where you need to frequently merge partitions to > optimize read and eliminate tombstones and expired data sooner, LCS can be > perfectly happy and efficiently perform the partition merge and tombstone > elimination for a long time. However, as soon as the node becomes a bit > unhealthy for various reasons (could be a bad disk so it's missing a whole > bunch of mutations and need repair, could be the user chooses to ingest way > more data than it usually takes and exceeds its capability, or god-forbidden, > some DBA chooses to run offline sstablelevelreset), you will have to handle > this kind of "all-new-partition with a lot of SSTables in L0" scenario, and > once all L0 SSTables finally gets up-leveled to L1, you will likely see a lot > of such single-SSTable compactions, which is the situation this JIRA is > intended to address. > Actually, when I think more about this, to make this kind of single SSTable > up-level more efficient will not only help the all-new-partition scenario, > but also help in general any time when there is a big backlog of L0 SSTables > due to too many flushes or excessive repair streaming with vnode. In those > situations, by default STCS_in_L0 will be triggered, and you will end up > getting a bunch of much bigger L0 SSTables after STCS is done. When it's time > to up-level those much bigger L0 SSTables most likely they will overlap among > themselves and you will add them all into your compaction session (along with > all overlapped L1 SSTables). For these much bigger L0 SSTables, they have > gone through a few rounds of STCS compactions, so if there's partition merge > that needs to be done because fragments of the same partition are dispersed > in smaller L0 SSTables earlier, after those STCS rounds, what you end up > having in those much bigger L0 SSTables (generated by STCS) will not have > much more opportunity for partition merge to happen, so we're in a scenario > very similar to L0 data "consists of a ton of newly inserted partitions that > will never merge in compactions" mentioned earlier. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org