[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173508#comment-14173508 ] Marcus Eriksson commented on CASSANDRA-6602: https://github.com/krummas/cassandra/commits/bj0rn/6602-2.0 Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid any
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173898#comment-14173898 ] Björn Hegerfors commented on CASSANDRA-6602: That's nice! Let's just hope nobody wants to use timestamps that are not any of the common time units. I guess 'base time' is vague, but it shouldn't be hard to understand what it means. Just a single sentence should be able to describe what it's about. I'm pretty happy with it. I don't know if anyone has a better idea. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172990#comment-14172990 ] Björn Hegerfors commented on CASSANDRA-6602: [~krummas] Yep, I also think of seconds/minutes/etc. when I hear 'time unit'. I agree time_unit is not the best name for what it ended up being. I guess I've tested time_unit with values like 60,000,000 (minute) and 3,600,000,000 (hour), and then timeUnit made sense in the code, when setting the initial target size to 1 of whatever the 'time unit' is (e.g. 1 hour). But if you use something like 300,000,000 (5 minutes), calling that a 'time unit' is a bit iffy. It feels like 'minute' is the time unit, and 5 some multiplier on how big the initial target is (so 5 minute units, rather than one 5-minute unit). Then you noted the other problem, that can cause even more confusion. The potentially differing timestamp formats is probably a harder nut to crack, put in relation with the naming. But the solution to it probably also affects the number of options and their names. You could even go with three options replacing time_unit: timestamp_resolution, time_unit, base_time. Then, for 5 minutes with microsecond timestamps, you would specify (with longs or strings, I don't know) timestamp_resolution=1,000,000 (microseconds), time_unit=60 (minutes), base_time=5. The time_unit that my code uses is simply the product of these options. What remains, as far as I can see, is a 2-option solution: remove the middle option (time_unit) and default to minutes or seconds (I personally think seconds are nicer). Or even microseconds, but then base_time (or base_time_microseconds) has to be a big number and the product should be timestamp_resolution * base_time / 1,000,000. The latter choice is only there if someone believes that sub-second targets could be useful. Regardless, I think it's important to make it clear to all users that they have to make sure that the 'timestamp resolution' is correct. One way would be to default to microseconds and simply put a visible warning in the documentation about DTCS that it expects microseconds, so if you use something else, you need to change the timestamp_resolution option. I don't know if Cassandra somehow prefers microseconds or if it likes to stay neutral and not estrange those who use something else. But CQL has that default, doesn't it? Another way to prevent bugs caused by non microsecond timestamps would be to simply require a timestamp_resolution to be defined. No defaults. I don't know if I've helped narrowing anything down here, but these are all the alternatives that I can think of. Without knowing what conventions there might be to things like this, my preferred choice right now is probably these options: long timestamp_resolution (default: 1,000,000), long base_time_seconds (default: 3600? 300?). That and the warning in the documentation, somewhere visible. My time to ask: wdyt? Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173399#comment-14173399 ] Marcus Eriksson commented on CASSANDRA-6602: how about: timestamp_resolution = 'MICROSECONDS' (and use TimeUnit.valueOf(...) to convert) base_time_seconds = 3600 max_sstable_age_days = 365 Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169137#comment-14169137 ] Marcus Eriksson commented on CASSANDRA-6602: [~Bj0rn] i think the resolution in time_unit is a bit to high, wdyt about making it in minutes instead? And, perhaps renaming it to something along the lines of 'base_time_minutes'? (or do you have a better suggestion?). A 'time unit' to me is usually second/minute/etc. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169173#comment-14169173 ] Marcus Eriksson commented on CASSANDRA-6602: bq. wdyt about making it in minutes instead? or maybe not, since people can use non microsecond timestamps... Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 2.0.11 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165082#comment-14165082 ] Jonathan Ellis commented on CASSANDRA-6602: --- I would have gone even further and said that since it's just a new strategy I'd be fine with letting people experiment even in 2.0. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160131#comment-14160131 ] Marcus Eriksson commented on CASSANDRA-6602: (epic writeup, hope you can reuse most of it in your thesis report) bq. if you run DTCS from the start, it won't matter what timestamp you chose as the representative timestamp of an SSTable. I think we should have a way for people to go from LCS/STCS to DTCS, probably using CASSANDRA-7019 - ie change to DTCS, run a 7019-major compaction and you should have your data date tiered. bq. The current implementation favors compacting and recompacting them as much as possible for the first hour (but not if they're less than min_compaction_threshold). I think this is the way to go, I imagine one of the most common queries to be give me the last hour of data, keeping a limit to the amount of sstables there should be beneficial. We should also note that currently the newest data will always be size tiered due to the way incremental repairs are handled, I hope to change this in CASSANDRA-8004, then we will run 2 instances of the compaction strategy and move sstables between them instead of running DTCS only for repaired data (and STCS for unrepaired data, which should be the newest set of sstables). I'd say we should put this in 2.1 since it doesn't touch anything outside of the compaction strategy, WDYT [~jbellis] ? Maybe label is as 'experimental' if we want to hint to users that they should test a bit more carefully etc. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158131#comment-14158131 ] Björn Hegerfors commented on CASSANDRA-6602: OK, so a writeup on some implementation details of this strategy might be in place. Maybe there's a better place to put something as long as this. But first, thanks for the patch Marcus! Everything looks good. Now, Marcus mentioned worried about sliding windows as far as SSTable candidate selection goes. Or at least that's what it sounded like. That was indeed one of my main concerns when I started implementing DTCS. After all, time is constantly moving, so how do we avoid this situation where some SSTables compact and then immediately slide over into some SSTables older than x time window and need to compact again. Or maybe before the first compaction, a few of the oldest of those SSTables managed to slip over already, causing the window for bigger SSTables to reach more than min_compaction_threshold and triggering a compaction with mixed size SSTables. Or something else like that. It can easily get messy, so I aimed for an algorithm with as little of a sliding windows effect as possible. What I came up with is really all summed up by the Target class. A target represents a time span between Unix epoch* and now. An SSTable is on target if its bmin/b timestamp is within a given target's time span. This is the first decision that I would like some input on. My thinking was that if there's a big SSTable with timestamps ranging from old to new, picking bmax/b timestamp would've made this data participate in too frequent compactions, because SSTables considered new compact more often than those considered old. STCS creates these kinds of SSTables, if you run DTCS from the start, it won't matter what timestamp you chose as the representative timestamp of an SSTable. So choosing bmin/b timestamp was for improved compatibility with STCS, essentially. How these targets are chosen is the most interesting part. I tend to think targets backwards in time. The first target is the one covering most recent timestamps. The most naive idea would be to let the first target be the last hour, the next target the hour before, and so on for 24 hours. Then target #25 would be the day from 24 hours ago to 48 hours ago. 6 of those would get us a week back. Then the same thing for months and years, etc. This approach has a number of issues. First, it needs a concept of a calendar that I would rather not hard code as everybody won't want it that way. Plus, it would compact by very varying factors like 24, 7 and 4 (but a month is not exactly 4 weeks, except Fabruary, sometimes... ugh!). So I quickly ditched the calendar for a configurable factor, called base, and an initial target size that I call time_unit. You could set base to 3 and timeUnit to an hour for example, which should be expressed in microseconds (if microseconds is what your timestamps use. I recommend sticking with microseconds, which appears to be a CQL standard. Timestamp inconsistencies mess with DTCS, which I've involuntarily experienced in my tests). Then the calendar-less version of the above would pick last hour as the initial target and 2 more for the hours before that. Then before that, 2 3-hour targets, then 2 9-hour targets, etc. I then noticed that base mostly duplicated the role of min_compaction_threshold, so I removed it in favor of that. Any input on that choice? The above solution is similar to my final algorithm, but if you noticed my wording, you'll see that the above suffers from the sliding windows issue. It's most probably a bad idea to use something like last hour or between 18 hours ago and 9 hours ago as targets, since those would be moving targets. Some integer arithmetic fit perfectly as a way around this. Let now be the current time (I get it by taking the greatest max timestamp across all SSTables, any input on that?). Let's keep timeUnit at an hour and base (min_compaction_threshold) at 3. Using integer division, now/timeUnit will yield the same value for an hour. Now we can use that as our first target. Unless we recently passed an hour threshold, (now - 20 seconds)/timeUnit will have the same result. We can't simply use the old approach of making 2 single-hour targets before that, etc. because that will just lead to the same sliding windows, but with considerably lower resolution (certainly better than before, though). Rather, consider what happens if you integer divide this initial target with base. Then we get a new target representing the 3 hour time span, perfectly aligned on 3 hour borders, that contains the current hour. The current hour could be either the 1st, 2nd or 3rd one of those hours (nothing in-between). You can continue like that, dividing by base again to get the perfectly aligned 9 hour time
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158177#comment-14158177 ] Jonathan Ellis commented on CASSANDRA-6602: --- Still reading, but thanks for the fantastic writeup! Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid any
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158316#comment-14158316 ] Jeremiah Jordan commented on CASSANDRA-6602: bq A side effect of this is that if min_compaction_threshold is, say 4, you might have 1 big and 2 small SSTables from the last hour, the moment time (more specifically, the now variable) crosses over to a new hour, then it will stay uncompacted like that until those SSTables enter a bigger target. At that point, what you would have expected to be 4 similarly sized SSTables in that new target would rather be 3 similarly sized ones, 1 a tad smaller, and 2 small ones (actually the same thing is likely to have happened during other hours too). But after that compaction happens, everything is like it should be**. I don't believe that it's a big deal. There are a couple ways around it. The (now/timeUnit) - 1 initial target approach fixes this. Another way would be to ignore min_compaction_threshold for anything beyond how I use it as a base, or keeping base and min_compaction_threshold separate, letting you set min_compaction_threshold to 2. Does anyone have any ideas about this? I think using the min_compaction_threshold in the current area is good, so you basically do STCS until its been long enough to bucket the sstable. Then once something is in a time bucket, treat it as min_compaction_threshold=2 so that each bucket only has one sstable in it. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp,
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158394#comment-14158394 ] Björn Hegerfors commented on CASSANDRA-6602: Yes, forcing all buckets but the current one to always contain a single SSTable (unless compactions fall behind) sounds good. It sort of raises the question of how multiple SSTables can end up in the same non-current target anyway. Like repairs and hinted handoff. Should we still always compact? We wouldn't want to compact a huge SSTable with some small new chunk too often. I suppose that's unlikely to happen (maybe never?) as the biggest SSTables are the oldest, and not much will be going on with them. Then you raise an idea that I haven't thought about before. You mention doing STCS in the current bucket. That's not what DTCS does now. Rather, it just compacts as soon as min_compaction_threshold SSTables are in there, regardless of any other criteria. In a sense, the time_unit option mirrors the behavior of the min_sstable_size option in STCS, where anything smaller than that goes into the same bucket regardless of actual size similarity. Of course, this means that time_unit has to be chosen properly. I'd say you can't really go wrong with too small, but too large might be bad. I believe that you should pick time_unit by the same criteria that you would pick min_sstable_size. If chosen properly, you won't really have a reason to do STCS in the current bucket. It's hard to set a default for time_unit, since it depends very much on how fast data comes in. Essentially, you can view time_unit as the age below which SSTables get compacted eagerly. Maybe there's a better name for it, like the obvious min_sstable_age? Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14123107#comment-14123107 ] Dan Hendry commented on CASSANDRA-6602: --- A somewhat different approach to optimizing time series data: CASSANDRA-7890 Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14118034#comment-14118034 ] Marcus Eriksson commented on CASSANDRA-6602: pushed an update version here: https://github.com/krummas/cassandra/commits/bj0rn/6602-trunk with a few small updates * rebased to trunk (need to handle repaired/unrepaired data) * fixed a few issues (getting compaction candidates needs to be synchronized) wdyt? I'd like to do a bit more testing before committing it, I'll try to get that started this week Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, TimestampViewer.java, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14098880#comment-14098880 ] Robert Coli commented on CASSANDRA-6602: For historical background, Kelvin Kakugawa did a basic version of the DateTieredCompationStrategy, at Digg, for a timeline product, in 2012 or so. It was successful in the limited use it saw there, especially when it dropped entire SSTables on the floor when they no longer contained data we cared about. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1405#comment-1405 ] Eric Evans commented on CASSANDRA-6602: --- Has there been any further discussion about where this should land (2.0, 2.1, 3.0)? 3.0.0 seems too conservative for a compaction strategy implementation. [~Bj0rn]: do you have any results from your production testing you can share? Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049237#comment-14049237 ] Eric Evans commented on CASSANDRA-6602: --- I like this; Let me know if I can be of help (review, testing, etc). Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid any fragmentation as long as all deliveries come in no more than 12 hours
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14029075#comment-14029075 ] Björn Hegerfors commented on CASSANDRA-6602: No, not yet. I plan to have a first draft of the report done by the end of this month. Right now, we are testing the strategy in production using write survey mode. We'll let it run for another week. That is going to produce the most real-world data for me to analyze. Also, I uploaded yet another version now, because when I said that I fixed a silly thing with default options, the silly part was making that change. Defaults are back to microseconds, as they obviously should be. It was my own tests confusing me by writing custom timestamps in milliseconds. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14028687#comment-14028687 ] Jonathan Ellis commented on CASSANDRA-6602: --- [~Bj0rn] is your thesis available as well? Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid any fragmentation as long as all deliveries come in no more than 12 hours late (in this example, presumably tunable). Anything that triggers compactions
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14027378#comment-14027378 ] Jonathan Ellis commented on CASSANDRA-6602: --- Is this on your list, [~krummas]? Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By delaying flushing the ahead one until we are ready to roll writes over to a third one, we are able to avoid any fragmentation as long as all deliveries come in no more than 12 hours late (in this example, presumably tunable). Anything that triggers compactions will have to
[jira] [Commented] (CASSANDRA-6602) Compaction improvements to optimize time series data
[ https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1401#comment-1401 ] Tupshin Harper commented on CASSANDRA-6602: --- Two comments: 1) Promising solution that I'd love to see validated and backported to at least 2.1, and if at all possible, all the way to 2.0.x 2) I don't want to end up closing the issue and losing track of the approaches benedict and I were talking about, so one or the other should become a new ticket. Compaction improvements to optimize time series data Key: CASSANDRA-6602 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Tupshin Harper Assignee: Björn Hegerfors Labels: compaction, performance Fix For: 3.0 Attachments: cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt There are some unique characteristics of many/most time series use cases that both provide challenges, as well as provide unique opportunities for optimizations. One of the major challenges is in compaction. The existing compaction strategies will tend to re-compact data on disk at least a few times over the lifespan of each data point, greatly increasing the cpu and IO costs of that write. Compaction exists to 1) ensure that there aren't too many files on disk 2) ensure that data that should be contiguous (part of the same partition) is laid out contiguously 3) deleting data due to ttls or tombstones The special characteristics of time series data allow us to optimize away all three. Time series data 1) tends to be delivered in time order, with relatively constrained exceptions 2) often has a pre-determined and fixed expiration date 3) Never gets deleted prior to TTL 4) Has relatively predictable ingestion rates Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers the need for it. In that ticket, jbellis reasonably asks, how that compaction strategy is better than disabling compaction. Taking that to heart, here is a compaction-strategy-less approach that could be extremely efficient for time-series use cases that follow the above pattern. (For context, I'm thinking of an example use case involving lots of streams of time-series data with a 5GB per day ingestion rate, and a 1000 day retention with TTL, resulting in an eventual steady state of 5TB per node) 1) You have an extremely large memtable (preferably off heap, if/when doable) for the table, and that memtable is sized to be able to hold a lengthy window of time. A typical period might be one day. At the end of that period, you flush the contents of the memtable to an sstable and move to the next one. This is basically identical to current behaviour, but with thresholds adjusted so that you can ensure flushing at predictable intervals. (Open question is whether predictable intervals is actually necessary, or whether just waiting until the huge memtable is nearly full is sufficient) 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be efficiently dropped once all of the columns have. (Another side note, it might be valuable to have a modified version of CASSANDRA-3974 that doesn't bother storing per-column TTL since it is required that all columns have the same TTL) 3) Be able to mark column families as read/write only (no explicit deletes), so no tombstones. 4) Optionally add back an additional type of delete that would delete all data earlier than a particular timestamp, resulting in immediate dropping of obsoleted sstables. The result is that for in-order delivered data, Every cell will be laid out optimally on disk on the first pass, and over the course of 1000 days and 5TB of data, there will only be 1000 5GB sstables, so the number of filehandles will be reasonable. For exceptions (out-of-order delivery), most cases will be caught by the extended (24 hour+) memtable flush times and merged correctly automatically. For those that were slightly askew at flush time, or were delivered so far out of order that they go in the wrong sstable, there is relatively low overhead to reading from two sstables for a time slice, instead of one, and that overhead would be incurred relatively rarely unless out-of-order delivery was the common case, in which case, this strategy should not be used. Another possible optimization to address out-of-order would be to maintain more than one time-centric memtables in memory at a time (e.g. two 12 hour ones), and then you always insert into whichever one of the two owns the appropriate range of time. By