[ https://issues.apache.org/jira/browse/CASSANDRA-7871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14125284#comment-14125284 ]
Marcus Eriksson edited comment on CASSANDRA-7871 at 9/8/14 7:50 AM: -------------------------------------------------------------------- Have not had time to properly verify your calculations yet, but I have a few first comments/thoughts; * https://issues.apache.org/jira/browse/CASSANDRA-5727 has some background on why we picked 160MB as a default sstable size (yes, we should redo the same benchmarks with different level multipliers) * With level multiplier=4 the data size calculation is a bit off - it would contain less than 75% of the data in the top level (but maybe still good enough?) * Having more levels means hitting more sstables when doing reads and, I guess, that this is why 10 was chosen in the first place, as a tradeoff between compaction io and number of sstables to touch during reads I'll get back with a more thorough review soon was (Author: krummas): Have not had time to properly verify/confirm your calculations yet, but I have a few first comments/thoughts; * https://issues.apache.org/jira/browse/CASSANDRA-5727 has some background on why we picked 160MB as a default sstable size (yes, we should redo the same benchmarks with different level multipliers) * With level multiplier=4 the data size calculation is a bit off - it would contain less than 75% of the data in the top level (but maybe still good enough?) * Having more levels means hitting more sstables when doing reads and, I guess, that this is why 10 was chosen in the first place, as a tradeoff between compaction io and number of sstables to touch during reads I'll get back with a more thorough review soon > Reduce compaction IO in LCS > ---------------------------- > > Key: CASSANDRA-7871 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7871 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Dan Hendry > Assignee: Dan Hendry > Fix For: 2.0.11 > > Attachments: LeveledCompactionImprovement-2.0.10.patch, > experiment.png, levelmultiplier.png, sstablesize.png > > > I have found LCS to be superior to STCS in almost every way - except for the > fact that it requires significantly more IO (a well advertised property). In > leveled compaction, L ~n+1~ is 10 times larger than L ~n~ so generally 1+10 > sstables need to be compacted to promote one sstable into the next level. For > certain workloads, this practically this means only 1/(10+1)=9% of the IO, > specifically write IO, is doing ‘useful’ work. > But why is each level 10 times larger? Why 10? Its a pretty looking number > and all but thats not a very good reason to choose it. If we chose 5 or even > 2 we could reduce the ‘wasted’ io required to promote an sstable to the next > level - of course at the expense of requiring more levels. I have not been > able to find justification for this choice in either cassandra or leveldb > itself. I would like to introduce a new parameter, the leveling multiplier, > which controls the desired size difference between L ~n~ and L ~n+1~. > First and foremost, a little math. Lets assume we have a CF of a fixed size > that is receiving continuous new data (ie: data is expiring due to TTLs or is > being overwritten). I believe the number of levels required is approximately > (see note 1): > {noformat}data size = (sstable size)*(leveling multiplier)^(level > count){noformat} > Which, when solving for the level count, becomes: > {noformat}level count = log((data size)/(sstable size))/log(leveling > multiplier){noformat} > The amount of compaction write IO required over the lifetime of a particular > piece of data (excluding compactions in L0) is: > {noformat}write IO = (flush IO) + (promotion IO)*(level count) > write IO = 1 + (1 + (level multiplier))*log((data size)/(sstable > size))/log(leveling multiplier){noformat} > So ultimately, the the relationship between write IO and the level multiplier > is f\(x) = (1 + x)/log\(x) which is optimal at 3.59, or 4 if we round to the > nearest integer. Also note that write IO is proportional to log((data > size)/(sstable size)) which suggests using larger sstables would also reduce > disk IO. > As one final analytical step we can add the following term to approximate STC > in L0 (which is not actually how its implemented but should be close enough > for moderate sstable sizes): > {noformat}L0 write IO = max(0, floor(log((sstable size)/(flush > size))/log(4))){noformat} > The following two graphs illustrate the predicted compaction requirements as > a function of the leveling multiplier and sstable size: > !levelmultiplier.png!!sstablesize.png! > In terms of empirically verifying the expected results, I set up three > cassandra nodes, node A having a leveling multiplier of 10 and sstable size > if 160 MB (current cassandra defaults), node B with multiplier 4 and size 160 > MB, and node C with multiplier 4 and size 1024 MB. I used a simple write only > workload which inserted data having a TTL of 2 days at 1 MB/second (see note > 2). Compaction throttling was disabled and gc_grace was 60 seconds. All nodes > had dedicated data disks and IO measurements were for the data disks only. > !experiment.png! > ||Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)|| > |Predicted IO Rate|34.4 MB/s|26.2 MB/s|20.5 MB/s| > |Predicted Improvement|n/a|23.8%|40.4%| > |Predicted Number of Levels (Expected Dataset of 169 GB)|3.0|5.0|3.7| > |Experimental IO Rate|32.0 MB/s|28.0 MB/s|20.4 MB/s| > |Experimental Improvement|n/a|12.4%|*36.3%*| > |Experimental Number of Levels|~4.1|~6.1|~4.8| > |Final Dataset Size (After 88 hours)|301 GB|261 GB|258 GB| > These results indicate that Node A performed better than expected, I suspect > that this was due to the fact that the data insertion rate was a little too > high and compaction periodically got backlogged meaning the promotion from L0 > to L1 was more efficient. Also note that the actual dataset size is larger > than that used in the analytical model - which is expected as expired data > will not get purged immediately. The size difference between node A and the > others however seems suspicious to me. > In summary, these results, both theoretical and experimental, clearly > indicate that reducing the level multiplier from 10 to 4 and increasing the > sstable size reduces compaction IO. The experimental results, using an > SSTable size of 1024 MB and level multiplier of 4, demonstrated a 36% > reduction in write IO without a significant increase in the number of levels. > I have not run benchmarks for an update heavy workload but I suspect it would > benefit significantly since more data can be ‘updated’ per compaction. I have > also not benchmarked read performance but I would not expect noticeable > performance degradation provided an sstable size is chosen which keeps the > number of levels roughly equal. > The patch I have attached is against 2.0.10 and does not change the defaults. > Long term however, it would make sense to use more optimal defaults unless > there is compelling counter evidence to the performance gains observed. > One final observation, in current leveled compaction the number of levels is > determined by the amount of data and the user specified sstable size. A > compaction strategy where instead the user selected the desired number of > levels and the strategy adjusted the SSTable size based on the amount of data > would have a number of benefits. The strategy would behave more consistently > across a much wider range of dataset sizes. Compaction IO overhead (as a > function of write rate) and worst case read performance (number of sstables > per read) would both be largely independent of dataset size. > Note 1: This equation only calculates the amount of data able to fit in the > largest level. It would be more accurate take into account data in smaller > levels (ie: using the geometric series equation) but this is a close enough > approximation. There is also the fact that redundant data might be spread > across the various levels. > Note 2: This represents the entropy introduction rate and does not account > for any Cassandra overhead but compression was also enabled. The row key was > a long, each row had 512 columns, the column name was a UUID, and the column > value was a 64 byte blob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)