[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16634: - Bug Category: Parent values: Correctness(12982) Complexity: Normal Component/s: Local/Compaction Discovered By: User Report Fix Version/s: 4.0.x 3.11.x Severity: Normal Status: Open (was: Triage Needed) > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brandon Williams updated CASSANDRA-16634: - Test and Documentation Plan: test included Status: Patch Available (was: Open) > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16634: Reviewers: Marcus Eriksson > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16634: Reviewers: Marcus Eriksson, Marcus Eriksson (was: Marcus Eriksson) Marcus Eriksson, Marcus Eriksson (was: Marcus Eriksson) Status: Review In Progress (was: Patch Available) > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16634: Status: Ready to Commit (was: Review In Progress) > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.x, 4.0.x > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16634) Garbagecollect should not output all tables to L0 with LeveledCompactionStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-16634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-16634: Fix Version/s: (was: 4.0.x) (was: 3.11.x) 4.0 4.0-rc2 3.11.11 Since Version: 3.10 Source Control Link: https://github.com/apache/cassandra/commit/a68a7c5181930053f9b513672391b45088e590c4 Resolution: Fixed Status: Resolved (was: Ready to Commit) committed, test failures look unrelated > Garbagecollect should not output all tables to L0 with > LeveledCompactionStrategy > > > Key: CASSANDRA-16634 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16634 > Project: Cassandra > Issue Type: Bug > Components: Local/Compaction >Reporter: Scott Carey >Assignee: Scott Carey >Priority: Normal > Fix For: 3.11.11, 4.0-rc2, 4.0 > > Time Spent: 10m > Remaining Estimate: 0h > > nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy. > This is awful. On a large LCS table, this means that at the end of the > garbagecollect process, all data is in L0. > > This results in an awful sequence of useless temporary space usage and write > amplification: > # L0 is repeatedly size-tiered compacted until it doesn't have too many > SSTables. If the original LCS table had 2000 tables... this takes a long time > # L0 is compacted to L1 in one to a couple very very large compactions > # L1 is compacted to L2, L3 to L4, etc. Write amplification galore > Due to the above, 'nodetool garbagecollect' is close to worthless for large > LCS tables. A full compaction is always less write amplification and similar > temp disk space required. The only exception is if you can use 'nodetool > garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 > is too large. In this case if you are lucky, and the order that it chose to > process SSTables coincides with tables that have the most disk space to > clear, you might free up enough disk space to succeed in your original goal. > > However, from what I can tell, there is no good reason to move the output to > L0. Leaving the output table in the same SSTableLevel as the source table > does not violate any of the LeveledCompactionStrategy placement rules, as the > output by definition has a token range equal to or smaller than the source. > The only drawback is if the size of the output files is significantly smaller > than the source, in which case the source level would be under-sized. But > that seems like a problem that LCS has to handle, not garbagecollect. > LCS could have a "pull up" operation where it does something like the > following. Assume a table has L4 as the max level, and L3 and L4 are both > 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not > overlap with the token ranges of the L3 tables. After that, it can choose to > do some compactions that mix L3 and L4 to pull up data into L3 if it is still > significantly under-sized. > From what I can tell, garbagecollect should just re-write tables in place, > and leave the compaction strategy to deal with any consequences. > Moving to L0 is a bad idea. In addition to the extra write amplification and > extreme increase in temporary disk space required, I observed the following: > A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. > We stopped it about 20% through the process, and it managed to compact down > the top couple levels. So we tried to run 'garbagecollect' again, but the > first tables it chose to operate on were in L1, not the 'leafs' in L5! This > was because the order of SSTables chosen currently does not consider the > level, and instead looks purely at the max timestamp in the file. But > because we moved _very old_ data from L5 into L0 as a result of the prior > gabagecollect, manytables in L1 and L2 now had very wide ranges between their > min and max timestamps – essentially some of the oldest and newest data all > in one table. This breaks the usual structure of an LCS table where the > oldest data is at the high levels. > > I hope that others agree that this is a bug, and deserving of a fix. > I have a very simple patch for this that I will be creating a PR for soon. 3 > lines for the code change, 70 lines for a new unit test. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org