[ https://issues.apache.org/jira/browse/CASSANDRA-18945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782692#comment-17782692 ]
Branimir Lambov commented on CASSANDRA-18945: --------------------------------------------- {quote}+ assert !(count < 0); // Must be positive, 0 or NaN, which should translate to baseShardCount Review Comment: @ethan-brown2022 `count >= 0` is more natural to me {quote} I can't find this to reply to it directly. The comment at the end of the line says that {{count}} can be {{{}NaN{}}}, which will fail {{count >= 0}} but pass {{{}!(count < 0){}}}. Perhaps we should change the bit after NaN to "(which would fail {{{}count >= 0,{}}}", but is acceptable and should translate to baseShardCount)" or something similar? > Unified Compaction Strategy is creating too many sstables > --------------------------------------------------------- > > Key: CASSANDRA-18945 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18945 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction > Reporter: Branimir Lambov > Assignee: Ethan Brown > Priority: Normal > Fix For: 5.0-beta > > Attachments: file_ucs_shenandoah.html, file_ucs_shenandoah_3.html, > file_ucs_shenandoah_off_heap_memtable.html, > file_ucs_shenandoah_on_heap_memtable_2.html, > file_ucs_shenandoah_on_heap_memtable_3.html, key-value-oss.html > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The unified compaction strategy currently aims to create sstables with close > to the same size, defaulting to 1 GiB. Unfortunately tests show that > Cassandra starts to have performance problems when the number of sstables > grows to the order of a thousand, and in particular that even 1 TiB of data > with the default configuration is creating too many sstables for efficient > processing. This matters even more for SAI, where the number of sstables in > the system can have a proportional effect on the complexity of operations. > It is quite easy to create a configuration option that allows sstables to > take some part of the data growth by adding a multiplier to [the shard count > calculation|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/UnifiedCompactionStrategy.md#sharding] > formula, replacing > {{2 ^ round(log2(d / (t * b))) * b}} > with > {{2 ^ round((1 - 𝜆) * log2(d / (t * b))) * b}}, > where 𝜆 is a parameter whose value is between 0 and 1. > With this, a 𝜆 of 0.5 would mean that shard count and sstable size grow in > parallel at the square root of the data size growth. 0 would result in no > growth, and 1 in always using the same number of shards. > It may also be valuable to introduce a threshold for engaging the base shard > count to avoid splitting lowest-level sstables into fragments that are too > small. > Once both of these are in place, we can set defaults that better suit all > node densities, including 10 TiB and beyond, for example: > - target size of 1 GiB > - 𝜆 of 1/3 > - base shard count of 4 > - minimum size 100 MiB -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org