[ https://issues.apache.org/jira/browse/CASSANDRA-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456278#comment-15456278 ]
Wei Deng edited comment on CASSANDRA-12591 at 9/2/16 10:32 PM: --------------------------------------------------------------- So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code just to prove the point whether this is a worthwhile effort. The hardware I'm using is still not a typical/adequate-enough configuration I'd use for a production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and a single 1TB persistent disk that's spindle-based), but I'm already seeing a positive sign that shows bigger max_sstable_size can be helpful for compaction throughput. Based on the initial results (at each max_sstable_size, I did three runs from scratch; for all runs I set compaction threads to 4, and since there will be no throttling enforced by compaction-stress the setting would be equivalent to setting compaction_throughput_mb_per_sec to 0, the initial SSTable files generated by `compaction-stress write` are using the default 128MB size, which is inline with the typical flush size I saw on this kind of hardware using default cassandra.yaml configuration parameters), using 10GB of stress data generated by the blogpost data model [here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml], the overall compaction times with 1280MB max_sstable_size are: 7m16.456s, 7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size are: 9m16.715s, 9m28.146s, 9m7.192s. Given these numbers, the average seconds to finish compaction with 1280MB max_sstable_size is 430.66, and the average seconds to finish compaction with 160MB max_sstable_size is 557.33, which is already a 23% improvement. The above tests were conducted using the default parameters from compaction-stress which generates unique partitions for all writes, so it reflects the worst kind of workload for LCS. Considering this, I also conducted another set of tests using "--partition-count=1000" to force compaction-stress to generate a lot of overwrites for the same partitions. While keeping everything else to same and adding this "--partition-count=1000" parameter, the overall compaction times with 1280MB max_sstable_size are: 4m59.307s, 4m52.002s, 5m0.967s; the overall compaction times with 160MB max_sstable_size are: 6m11.533s, 6m21.200s, 6m10.904s. These numbers are understandably faster than the "all unique partition" scenario in the last paragraph, and if you calculate the average seconds, 1280MB max_sstable_size is 21% faster than 160MB max_sstable_size. I realize 10GB data is barely enough to test 1280MB sstable size as the data will only go from L0->L1, so the next run I'm going to use 100GB data size on this hardware (keeping everything else the same) and see how the numbers compare. was (Author: weideng): So I've done some quick initial tests using latest trunk (i.e. C* 3.10) code just to prove the point whether this is a worthwhile effort. The hardware I'm using is still not a typical/adequate-enough configuration I'd use for a production Cassandra deployment (GCE n1-standard-4, with 4 vCPUs, 15GB RAM and a single 1TB persistent disk that's spindle-based), but I'm already seeing a positive sign that shows bigger max_sstable_size can be helpful for compaction throughput. Based on the initial results (at each max_sstable_size, I did three runs from scratch; for all runs I set compaction threads to 4, and since there will be no throttling enforced by compaction-stress the setting would be equivalent to setting compaction_throughput_mb_per_sec to 0, the initial SSTable files generated by `compaction-stress write` are using the default 128MB size, which is inline with the typical flush size I saw on this kind of hardware using default cassandra.yaml configuration parameters), using 10GB of stress data generated by the blogpost data model [here|https://gist.githubusercontent.com/tjake/8995058fed11d9921e31/raw/a9334d1090017bf546d003e271747351a40692ea/blogpost.yaml], the overall compaction times with 1280MB max_sstable_size are: 7m16.456s, 7m7.225s, 7m9.102s; the overall compaction times with 160MB max_sstable_size are: 9m16.715s, 9m28.146s, 9m7.192s. Given these numbers, the average seconds to finish compaction with 1280MB max_sstable_size is 430.66, and the average seconds to finish compaction with 160MB max_sstable_size is 557.33, which is already a 23% improvement. I realize 10GB data is barely enough to test 1280MB sstable size as the data will only go from L0->L1, so the next run I'm going to use 100GB data size on this hardware (keeping everything else the same) and see how the numbers compare. > Re-evaluate the default 160MB sstable_size_in_mb choice in LCS > -------------------------------------------------------------- > > Key: CASSANDRA-12591 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12591 > Project: Cassandra > Issue Type: Improvement > Components: Compaction > Reporter: Wei Deng > Labels: lcs > > There has been some effort from CASSANDRA-5727 in benchmarking and evaluating > the best max_sstable_size used by LeveledCompactionStrategy, and the > conclusion derived from that effort was to use 160MB as the most optimal size > for both throughput (i.e. the time spent on compaction, the smaller the > better) and the amount of bytes compacted (to avoid write amplification, the > less the better). > However, when I read more into that test report (the short > [comment|https://issues.apache.org/jira/browse/CASSANDRA-5727?focusedCommentId=13722571&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13722571] > describing the tests), I realized it was conducted on a hardware with the > following configuration: "a single rackspace node with 2GB of ram." I'm not > sure if this was an ok hardware configuration for production Cassandra > deployment at that time (mid-2013), but it is definitely far lower from > today's hardware standard now. > Given that we now have compaction-stress which is able to generate SSTables > based on user defined stress profile with user defined table schema and > compaction parameters (compatible to cassandra-stress), it would be a useful > effort to relook at this number using a more realistic hardware configuration > and see if 160MB is still the optimal choice. It might also impact our > perceived "practical" node density with LCS nodes if it turns out bigger > max_sstable_size actually works better as it will allow less number of > SSTables (and hence less level and less write amplification) per node with > bigger density. -- This message was sent by Atlassian JIRA (v6.3.4#6332)