[ 
https://issues.apache.org/jira/browse/CASSANDRA-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14158316#comment-14158316
 ] 

Jeremiah Jordan commented on CASSANDRA-6602:
--------------------------------------------

bq A side effect of this is that if min_compaction_threshold is, say 4, you 
might have 1 big and 2 small SSTables from the last hour, the moment time (more 
specifically, the "now" variable) crosses over to a new hour, then it will stay 
uncompacted like that until those SSTables enter a bigger target. At that 
point, what you would have expected to be 4 similarly sized SSTables in that 
new target would rather be 3 similarly sized ones, 1 a tad smaller, and 2 small 
ones (actually the same thing is likely to have happened during other hours 
too). But after that compaction happens, everything is like it should be**. I 
don't believe that it's a big deal. There are a couple ways around it. The 
(now/timeUnit) - 1 initial target approach fixes this. Another way would be to 
ignore min_compaction_threshold for anything beyond how I use it as a "base", 
or keeping base and min_compaction_threshold separate, letting you set 
min_compaction_threshold to 2. Does anyone have any ideas about this?

I think using the min_compaction_threshold in the "current" area is good, so 
you basically do STCS until its been long enough to "bucket" the sstable.  Then 
once something is in a time bucket, treat it as min_compaction_threshold=2 so 
that each bucket only has one sstable in it.

> Compaction improvements to optimize time series data
> ----------------------------------------------------
>
>                 Key: CASSANDRA-6602
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Björn Hegerfors
>              Labels: compaction, performance
>             Fix For: 3.0
>
>         Attachments: 1 week.txt, 8 weeks.txt, STCS 16 hours.txt, 
> TimestampViewer.java, 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy.txt, 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v2.txt, 
> cassandra-2.0-CASSANDRA-6602-DateTieredCompactionStrategy_v3.txt
>
>
> There are some unique characteristics of many/most time series use cases that 
> both provide challenges, as well as provide unique opportunities for 
> optimizations.
> One of the major challenges is in compaction. The existing compaction 
> strategies will tend to re-compact data on disk at least a few times over the 
> lifespan of each data point, greatly increasing the cpu and IO costs of that 
> write.
> Compaction exists to
> 1) ensure that there aren't too many files on disk
> 2) ensure that data that should be contiguous (part of the same partition) is 
> laid out contiguously
> 3) deleting data due to ttls or tombstones
> The special characteristics of time series data allow us to optimize away all 
> three.
> Time series data
> 1) tends to be delivered in time order, with relatively constrained exceptions
> 2) often has a pre-determined and fixed expiration date
> 3) Never gets deleted prior to TTL
> 4) Has relatively predictable ingestion rates
> Note that I filed CASSANDRA-5561 and this ticket potentially replaces or 
> lowers the need for it. In that ticket, jbellis reasonably asks, how that 
> compaction strategy is better than disabling compaction.
> Taking that to heart, here is a compaction-strategy-less approach that could 
> be extremely efficient for time-series use cases that follow the above 
> pattern.
> (For context, I'm thinking of an example use case involving lots of streams 
> of time-series data with a 5GB per day ingestion rate, and a 1000 day 
> retention with TTL, resulting in an eventual steady state of 5TB per node)
> 1) You have an extremely large memtable (preferably off heap, if/when doable) 
> for the table, and that memtable is sized to be able to hold a lengthy window 
> of time. A typical period might be one day. At the end of that period, you 
> flush the contents of the memtable to an sstable and move to the next one. 
> This is basically identical to current behaviour, but with thresholds 
> adjusted so that you can ensure flushing at predictable intervals. (Open 
> question is whether predictable intervals is actually necessary, or whether 
> just waiting until the huge memtable is nearly full is sufficient)
> 2) Combine the behaviour with CASSANDRA-5228 so that sstables will be 
> efficiently dropped once all of the columns have. (Another side note, it 
> might be valuable to have a modified version of CASSANDRA-3974 that doesn't 
> bother storing per-column TTL since it is required that all columns have the 
> same TTL)
> 3) Be able to mark column families as read/write only (no explicit deletes), 
> so no tombstones.
> 4) Optionally add back an additional type of delete that would delete all 
> data earlier than a particular timestamp, resulting in immediate dropping of 
> obsoleted sstables.
> The result is that for in-order delivered data, Every cell will be laid out 
> optimally on disk on the first pass, and over the course of 1000 days and 5TB 
> of data, there will "only" be 1000 5GB sstables, so the number of filehandles 
> will be reasonable.
> For exceptions (out-of-order delivery), most cases will be caught by the 
> extended (24 hour+) memtable flush times and merged correctly automatically. 
> For those that were slightly askew at flush time, or were delivered so far 
> out of order that they go in the wrong sstable, there is relatively low 
> overhead to reading from two sstables for a time slice, instead of one, and 
> that overhead would be incurred relatively rarely unless out-of-order 
> delivery was the common case, in which case, this strategy should not be used.
> Another possible optimization to address out-of-order would be to maintain 
> more than one time-centric memtables in memory at a time (e.g. two 12 hour 
> ones), and then you always insert into whichever one of the two "owns" the 
> appropriate range of time. By delaying flushing the ahead one until we are 
> ready to roll writes over to a third one, we are able to avoid any 
> fragmentation as long as all deliveries come in no more than 12 hours late 
> (in this example, presumably tunable).
> Anything that triggers compactions will have to be looked at, since there 
> won't be any. The one concern I have is the ramificaiton of repair. 
> Initially, at least, I think it would be acceptable to just write one sstable 
> per repair and not bother trying to merge it with other sstables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to