[ 
https://issues.apache.org/jira/browse/CASSANDRA-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123668#comment-14123668
 ] 

Jonathan Ellis commented on CASSANDRA-7890:
-------------------------------------------

bq. Im curious about the historical choice to order data on disk by token and 
not key.

That means that adding new nodes means you stream contiguous ranges.

> LCS and time series data
> ------------------------
>
>                 Key: CASSANDRA-7890
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7890
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Dan Hendry
>             Fix For: 3.0
>
>
> Consider the following very typical schema for bucketed time series data:
> {noformat}
> CREATE TABLE user_timeline (
>       ts_bucket bigint,
>       username varchar,
>       ts timeuuid,
>       data blob,
>       PRIMARY KEY ((ts_bucket, username), ts))
> {noformat}
> If you have a single cassandra node (or cluster where RF = N) and use the 
> ByteOrderedPartitioner, LCS becomes *ridiculously*, *obscenely*, efficient. 
> Under a typical workload where data is inserted in order, compaction IO could 
> be reduced to *near zero* as sstable ranges dont overlap (with a trivial 
> change to LCS so sstables with no overlap are not rewritten when being 
> promoted into the next level). Better yet, we don't _require_ ordered data 
> insertion. Even if insertion order is completely random, you still get 
> standard LCS performance characteristics which are usually acceptable 
> (although I believe there are a few degenerate compaction cases which are not 
> handled in the current implementation). A quick benchmark using vanilla 
> cassandra 2.0.10 (ie no rewrite optimization) shows a *77% reduction in 
> compaction IO* when switching from the Murmur3Partitioner to the 
> ByteOrderedPartitioner.
> The obvious problem is, of course, that using an order preserving partitioner 
> is a Very Bad idea when N > RF. Using an OPP for time series data ordered by 
> time is utter lunacy.
> It seems to me that one solution is to split apart the roles of the 
> partitioner so that data distribution across the cluster and data ordering on 
> disk can be controlled independently. Ideally on disk ordering could be set 
> per CF. Im curious about the historical choice to order data on disk by token 
> and not key. Randomized (hashed key ordered) distribution across the cluster 
> is obviously a good idea but natural key ordered on disk seem like it would 
> have a number of advantages:
> * Better read performance and file system page cache efficiency for any 
> workload which access certain ranges of row keys more frequently than others 
> (this applies to _many_ use cases beyond time series data).
> * I can't think of a realistic workload where CRUD operations would be 
> noticeably less performant when using natural instead of hash ordering. 
> * Better compression ratios (although probably only for skinny rows).
> * Range based truncation becomes feasible.
> * Ordered range scans might be feasible to implement even with random cluster 
> distribution.
> The only things I can think of which could suffer when using different 
> cluster and disk ordering are bootstrap and repair. Although I have no 
> evidence, the massive potential performance gains certainly still seem to be 
> worth it.
> Thoughts? This approach seems to be fundamentally different from other 
> tickets related to improving time series data (CASSANDRA-6602, 
> CASSANDRA-5561) which focus only on new or modified compaction strategies. By 
> changing data sort order, existing compaction strategies can be made 
> significantly more efficient without imposing new, restrictive, and use case 
> specific limitations on the user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to