[jira] [Commented] (CASSANDRA-9028) Optimize LIMIT execution to mitigate need for a full partition scan

Sylvain Lebresne (JIRA) Thu, 26 Mar 2015 03:49:16 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381688#comment-14381688
 ]


Sylvain Lebresne commented on CASSANDRA-9028:
---------------------------------------------

Well, the trace does says that all sstables have been "touched" as you said, 
and they have, but "touching" a sstable is world away from reading the entire 
partition in memory. The reason your first query does "touch" 2 sstables is 
that the code does not know which sstable will have results for the query, how 
much it will have nor which results will sort first. This is not particularly 
abnormal, there is so much the storage engine can deduce without reading any 
data, but this doesn't change the fact that as little as possible is read in 
each sstable and we certainly don't retrieve entire partitions unless we have 
to.

The reason the 2nd request actually only hit a single sstable is that this 
request is more restricted and the engine is able to use that additional 
restriction to eliminate one of the sstable.

For completness sake, I'll note that there is actually some optimization we're 
contemplating in CASSANDRA-8180 to avoid "touching" sstables in some cases. 
This might or might not help your first query, I honestly haven't looked 
closely enough at the example to say. It won't make a terribly huge difference 
in any case.

> Optimize LIMIT execution to mitigate need for a full partition scan
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-9028
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9028
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: jonathan lacefield
>         Attachments: Data.1.json, Data.2.json, Data.3.json, test.ddl, 
> tracing.out
>
>
> Currently, a SELECT statement for a single Partition Key that contains a 
> LIMIT X clause will fetch an entire partition from a node and place the 
> partition into memory prior to applying the limit clause and returning 
> results to be served to the client via the coordinator.
> This JIRA is to request an optimization for the CQL LIMIT clause to avoid the 
> entire partition retrieval step, and instead only retrieve the components to 
> satisfy the LIMIT condition.
> Ideally, any LIMIT X would avoid the need to retrieve a full partition.  This 
> may not be possible though.  As a compromise, it would still be incredibly 
> beneficial if a LIMIT 1 clause could be optimized to only retrieve the 
> "latest" item.  Ideally a LIMIT 1 would "operationally behave" the same way 
> as a Clustering Key WHERE clause where the "latest", i.e. LIMIT 1 field, col 
> value was specified.
> We can supply some trace results to help show the difference between 2 
> different queries that preform the same logical function if desired.
>   For example, a query that returns the latest value for a clustering col 
> where QUERY 1 uses a LIMIT 1 clause and QUERY 2 uses a WHERE <clustering col> 
> = <latest value>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9028) Optimize LIMIT execution to mitigate need for a full partition scan

Reply via email to