[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams updated CASSANDRA-16634: ----------------------------------------- Test and Documentation Plan: test included Status: Patch Available (was: Open) > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > -------------------------------------------------------------------------------- > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction > Reporter: Scott Carey > Assignee: Scott Carey > Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org