[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381674#comment-14381674 ]
Sylvain Lebresne commented on CASSANDRA-8180: --------------------------------------------- bq. l I found it much easier to understand Glad that it's the case. bq. I think it might make sense if I implement this change directly on a branch based on {{8099_engine_refactor}} I wouldn't be the one to blame you for that. bq. I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in {{SinglePartitionSliceCommand}} and the second time to return the data You actually don't have to care about the limit (in SinglePartitionSliceCommand at least). The way to do this would be to return an iterator that first query and return the results of the first sstable and once it has returned all results, it transparently query the 2nd sstable and start returning those results, etc... That being said, I do suspect doing this at the merging level (in MergeIterator) would be better. The idea would be to special the merge iterator to take specific iterators that expose some {{lowerBound()}} method. That method would be allowed to return a value that is not returned by the iterator but is lower than anything it will return. The merge iterator would use those lower bound as initial {{Candidate}} for the iterators but know that when it consumes those canditates it should just discard them (and get the actual next value of the iterator). Basically, we'd add a way for the iterator to say "don't bother using me until you've at least reached value X". The sstable iterators would typically implement that {{lowerBound}} method by returning the sstable "min column name". Provided we make sure the sstable iterators don't do any work unless their {{hasNext/next}} methods are called, we wouldn't actually use a sstable until we've reached it's "min column name". Doing it that way would 2 advantages over doing it at the "collation" level: # this is more general as it would work even if the sstables min/max column name intersects (it's harder/uglier to do the same at the collation level imo). # this would work for range queries too. We may want to build that on top of CASSANDRA-8915 however. > Optimize disk seek using min/max colunm name meta data when the LIMIT clause > is used > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 > Reporter: DOAN DuyHai > Assignee: Stefania > Priority: Minor > Fix For: 3.0 > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 > Preparing statement | > 23:48:46,499 | 127.0.0.1 | 253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 | 930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 | 943 > Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 > Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 > Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 > Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 > Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 > Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 > Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 > Preparing statement | > 23:52:31,888 | 127.0.0.1 | 277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 | 961 > Acquiring sstable references | > 23:52:31,889 | 127.0.0.1 | 971 > Merging memtable tombstones | > 23:52:31,889 | 127.0.0.1 | 1020 > Key cache hit for sstable 3 | > 23:52:31,889 | 127.0.0.1 | 1108 > Seeking to partition beginning in data file | > 23:52:31,889 | 127.0.0.1 | 1117 > Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:52:31,889 | 127.0.0.1 | 1611 > Merging data from memtables and 1 sstables | > 23:52:31,890 | 127.0.0.1 | 1624 > Read 1 live and 0 tombstoned cells | > 23:52:31,890 | 127.0.0.1 | 1700 > Request complete | > 23:52:31,890 | 127.0.0.1 | 2140 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)