[jira] [Commented] (CASSANDRA-8731) Optimise merges involving multiple clustering columns

Sylvain Lebresne (JIRA) Wed, 04 Mar 2015 08:45:45 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347096#comment-14347096
 ]


Sylvain Lebresne commented on CASSANDRA-8731:
---------------------------------------------

I've created CASSANDRA-6936. I did mentioned the idea as I had it however and 
I'm pretending it's full proof, nor that I have it all figured out. I just 
thing there may in general be a way to put the metadata we have to reduce the 
amount of comparisons in some cases.

> Optimise merges involving multiple clustering columns
> -----------------------------------------------------
>
>                 Key: CASSANDRA-8731
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8731
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>              Labels: performance
>             Fix For: 3.0
>
>
> Since the new storage format is dead in the water for the moment, we should 
> do our best to optimise current behaviour. When merging data from multiple 
> sstables with multiple clustering columns, currently we must incur the full 
> costs of comparison for the entire matching prefix, and must heapify every 
> cell in our PriorityQueue, incurring lg(N) of these costlier comparisons for 
> every cell we merge, where N is the number of sources we're merging.
> Essentially I'm proposing a trie-based merge approach as a replacement for 
> the ManyToOne MergeIterator, wherein we treat each clustering component as a 
> tree underwhich all Cells with a common prefix occur. We then perform a tree 
> merge, rather than a flat merge. For byte-order fields this trie can even be 
> a full binary-trie (although built on the fly). The advantage here is that we 
> rapidly prune merges involving disjoint ranges, so that instead of always 
> incurring lg(N) costs on each new record, we may often incur O(1) costs. For 
> timeseries data, for instance, we could merge dozens of files and so long as 
> they were non-overlapping our CPU burden would be little more than reading 
> from a single file.
> On top of this, we no longer incur any of the shared prefix repetition costs, 
> since we compare each prefix piece-wise, and only once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8731) Optimise merges involving multiple clustering columns

Reply via email to