[ https://issues.apache.org/jira/browse/CASSANDRA-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14123668#comment-14123668 ]
Jonathan Ellis commented on CASSANDRA-7890: ------------------------------------------- bq. Im curious about the historical choice to order data on disk by token and not key. That means that adding new nodes means you stream contiguous ranges. > LCS and time series data > ------------------------ > > Key: CASSANDRA-7890 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7890 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Dan Hendry > Fix For: 3.0 > > > Consider the following very typical schema for bucketed time series data: > {noformat} > CREATE TABLE user_timeline ( > ts_bucket bigint, > username varchar, > ts timeuuid, > data blob, > PRIMARY KEY ((ts_bucket, username), ts)) > {noformat} > If you have a single cassandra node (or cluster where RF = N) and use the > ByteOrderedPartitioner, LCS becomes *ridiculously*, *obscenely*, efficient. > Under a typical workload where data is inserted in order, compaction IO could > be reduced to *near zero* as sstable ranges dont overlap (with a trivial > change to LCS so sstables with no overlap are not rewritten when being > promoted into the next level). Better yet, we don't _require_ ordered data > insertion. Even if insertion order is completely random, you still get > standard LCS performance characteristics which are usually acceptable > (although I believe there are a few degenerate compaction cases which are not > handled in the current implementation). A quick benchmark using vanilla > cassandra 2.0.10 (ie no rewrite optimization) shows a *77% reduction in > compaction IO* when switching from the Murmur3Partitioner to the > ByteOrderedPartitioner. > The obvious problem is, of course, that using an order preserving partitioner > is a Very Bad idea when N > RF. Using an OPP for time series data ordered by > time is utter lunacy. > It seems to me that one solution is to split apart the roles of the > partitioner so that data distribution across the cluster and data ordering on > disk can be controlled independently. Ideally on disk ordering could be set > per CF. Im curious about the historical choice to order data on disk by token > and not key. Randomized (hashed key ordered) distribution across the cluster > is obviously a good idea but natural key ordered on disk seem like it would > have a number of advantages: > * Better read performance and file system page cache efficiency for any > workload which access certain ranges of row keys more frequently than others > (this applies to _many_ use cases beyond time series data). > * I can't think of a realistic workload where CRUD operations would be > noticeably less performant when using natural instead of hash ordering. > * Better compression ratios (although probably only for skinny rows). > * Range based truncation becomes feasible. > * Ordered range scans might be feasible to implement even with random cluster > distribution. > The only things I can think of which could suffer when using different > cluster and disk ordering are bootstrap and repair. Although I have no > evidence, the massive potential performance gains certainly still seem to be > worth it. > Thoughts? This approach seems to be fundamentally different from other > tickets related to improving time series data (CASSANDRA-6602, > CASSANDRA-5561) which focus only on new or modified compaction strategies. By > changing data sort order, existing compaction strategies can be made > significantly more efficient without imposing new, restrictive, and use case > specific limitations on the user. -- This message was sent by Atlassian JIRA (v6.3.4#6332)