Dan Hendry created CASSANDRA-7890:
-------------------------------------

             Summary: LCS and time series data
                 Key: CASSANDRA-7890
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7890
             Project: Cassandra
          Issue Type: New Feature
          Components: Core
            Reporter: Dan Hendry


Consider the following very typical schema for bucketed time series data:

{noformat}
CREATE TABLE user_timeline (
        ts_bucket bigint,
        username varchar,
        ts timeuuid,
        data blob,
        PRIMARY KEY ((ts_bucket, username), ts))
{noformat}

If you have a single cassandra node (or cluster where RF = N) and use the 
ByteOrderedPartitioner, LCS becomes *ridiculously*, *obscenely*, efficient. 
Under a typical workload where data is inserted in order, compaction IO could 
be reduced to *near zero* as sstable ranges dont overlap (with a trivial change 
to LCS so sstables with no overlap are not rewritten when being promoted into 
the next level). Better yet, we don't _require_ ordered data insertion. Even if 
insertion order is completely random, you still get standard LCS performance 
characteristics which are usually acceptable (although I believe there are a 
few degenerate compaction cases which are not handled in the current 
implementation). A quick benchmark using vanilla cassandra 2.0.10 (ie no 
rewrite optimization) shows a *77% reduction in compaction IO* when switching 
from the Murmur3Partitioner to the ByteOrderedPartitioner.

The obvious problem is, of course, that using an order preserving partitioner 
is a Very Bad idea when N > RF. Using an OPP for time series data ordered by 
time is utter lunacy.

It seems to me that one solution is to split apart the roles of the partitioner 
so that data distribution across the cluster and data ordering on disk can be 
controlled independently. Ideally on disk ordering could be set per CF. Im 
curious about the historical choice to order data on disk by token and not key. 
Randomized (hashed key ordered) distribution across the cluster is obviously a 
good idea but natural key ordered on disk seem like it would have a number of 
advantages:

* Better read performance and file system page cache efficiency for any 
workload which access certain ranges of row keys more frequently than others 
(this applies to _many_ use cases beyond time series data).
* I can't think of a realistic workload where CRUD operations would be 
noticeably less performant when using natural instead of hash ordering. 
* Better compression ratios (although probably only for skinny rows).
* Range based truncation becomes feasible.
* Ordered range scans might be feasible to implement even with random cluster 
distribution.

The only things I can think of which could suffer when using different cluster 
and disk ordering are bootstrap and repair. Although I have no evidence, the 
massive potential performance gains certainly still seem to be worth it.

Thoughts? This approach seems to be fundamentally different from other tickets 
related to improving time series data (CASSANDRA-6602, CASSANDRA-5561) which 
focus only on new or modified compaction strategies. By changing data sort 
order, existing compaction strategies can be made significantly more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to