Tupshin Harper created CASSANDRA-6602:
-----------------------------------------

             Summary: Enhancements to optimize for the storing of time series 
data
                 Key: CASSANDRA-6602
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6602
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Tupshin Harper
             Fix For: 3.0


There are some unique characteristics of many/most time series use cases that 
both provide challenges, as well as provide unique opportunities for 
optimizations.

One of the major challenges is in compaction. The existing compaction 
strategies will tend to re-compact data on disk at least a few times over the 
lifespan of each data point, greatly increasing the cpu and IO costs of that 
write.

Compaction exists to
1) ensure that there aren't too many files on disk
2) ensure that data that should be contiguous (part of the same partition) is 
laid out contiguously
3) deleting data due to ttls or tombstones

The special characteristics of time series data allow us to optimize away all 
three.

Time series data
1) tends to be delivered in time order, with relatively constrained exceptions
2) often has a pre-determined and fixed expiration date
3) Never gets deleted prior to TTL
4) Has relatively predictable ingestion rates

Note that I filed CASSANDRA-5561 and this ticket potentially replaces or lowers 
the need for it. In that ticket, jbellis reasonably asks, how that compaction 
strategy is better than disabling compaction.

Taking that to heart, here is a compaction-strategy-less approach that could be 
extremely efficient for time-series use cases that follow the above pattern.

(For context, I'm thinking of an example use case involving lots of streams of 
time-series data with a 5GB per day ingestion rate, and a 1000 day retention 
with TTL, resulting in an eventual steady state of 5TB per node)

1) You have an extremely large memtable (preferably off heap, if/when doable) 
for the table, and that memtable is sized to be able to hold a lengthy window 
of time. A typical period might be one day. At the end of that period, you 
flush the contents of the memtable to an sstable and move to the next one. This 
is basically identical to current behaviour, but with thresholds adjusted so 
that you can ensure flushing at predictable intervals. (Open question is 
whether predictable intervals is actually necessary, or whether just waiting 
until the huge memtable is nearly full is sufficient)
2) Combine the behaviour with CASSANDRA-5228 so that sstables will be 
efficiently dropped once all of the columns have. (Another side note, it might 
be valuable to have a modified version of CASSANDRA-3974 that doesn't bother 
storing per-column TTL since it is required that all columns have the same TTL)
3) Be able to mark column families as read/write only (no explicit deletes), so 
no tombstones.
4) Optionally add back an additional type of delete that would delete all data 
earlier than a particular timestamp, resulting in immediate dropping of 
obsoleted sstables.

The result is that for in-order delivered data, Every cell will be laid out 
optimally on disk on the first pass, and over the course of 1000 days and 5TB 
of data, there will "only" be 1000 5GB sstables, so the number of filehandles 
will be reasonable.

For exceptions (out-of-order delivery), most cases will be caught by the 
extended (24 hour+) memtable flush times and merged correctly automatically. 
For those that were slightly askew at flush time, or were delivered so far out 
of order that they go in the wrong sstable, there is relatively low overhead to 
reading from two sstables for a time slice, instead of one, and that overhead 
would be incurred relatively rarely unless out-of-order delivery was the common 
case, in which case, this strategy should not be used.

Another possible optimization to address out-of-order would be to maintain more 
than one time-centric memtables in memory at a time (e.g. two 12 hour ones), 
and then you always insert into whichever one of the two "owns" the appropriate 
range of time. By delaying flushing the ahead one until we are ready to roll 
writes over to a third one, we are able to avoid any fragmentation as long as 
all deliveries come in no more than 12 hours late (in this example, presumably 
tunable).

Anything that triggers compactions will have to be looked at, since there won't 
be any. The one concern I have is the ramificaiton of repair. Initially, at 
least, I think it would be acceptable to just write one sstable per repair and 
not bother trying to merge it with other sstables.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to