[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121128#comment-15121128 ]
Stefania commented on CASSANDRA-8180: ------------------------------------- Latest commit available [here|https://github.com/stef1927/cassandra/commit/a0e779d5970b764f2732c387067218407617b4ab]. bq. I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right? You are right to be worried, this test fails on trunk: {code} createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))"); for (int i = 0; i < 10; i++) execute("INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')", i); flush(); execute("DELETE FROM %s WHERE a=1 and b <= 3"); flush(); assertEmpty(execute("SELECT * FROM %s WHERE a=1 and b <= 2")); {code} I git-annotated the original code in 2.2, which is in {{CollationController}} at line 293 and traced it to CASSANDRA-5514. I think the comment explaining it is [this|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13645632&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13645632]. However, perhaps back than range tombstones did not exist? BTW, [this comment|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13656934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13656934] explains quite well the _raison d'ĂȘtre_ of the min and max clustering values in the stats. bq. Shouldn't minTimestamp be updated only if shouldInclude()? I agree and we should also update {{minTimestamp}} by looking at memtables as well now that CASSANDRA-9949 is available. bq. The trace on merging doesn't seem to be using the right value. Done, thanks. bq. You should use Transformation for withSSTablesIterated, and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything. Done. Because {{BaseRows}} eagerly caches the static row in its constructor, I had to change one thing: rather than counting an sstable as iterated when the iterator is initialized, we do so when {{computeNext()}} is called. I think this is fair since the constructor only accesses the data partition in very limited cases as discussed above. However we do access the partition index, which prior to this we would only do for sstables with tombstones. If this is not acceptable then either we don't use {{Transformation}} or we change {{BaseRows}}. I've also attached a boolean to the iterator itself rather than counting the tables via {{sstablesIterated}} and I was therefore able to get rid of {{QueryMemtableAndDiskStatus}}. It seems a bit cleaner like this. I called the method returning that boolean {{iterated()}} but perhaps we can find a better name. bq. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()). Done, thanks. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > ------------------------------------------------------------------------------------ > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 > Reporter: DOAN DuyHai > Assignee: Stefania > Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 > Preparing statement | > 23:48:46,499 | 127.0.0.1 | 253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 | 930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 | 943 > Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 > Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 > Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 > Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 > Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 > Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 > Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp | source | source_elapsed > ---------------------------------------------------------------------------+--------------+-----------+---------------- > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 > Preparing statement | > 23:52:31,888 | 127.0.0.1 | 277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 | 961 > Acquiring sstable references | > 23:52:31,889 | 127.0.0.1 | 971 > Merging memtable tombstones | > 23:52:31,889 | 127.0.0.1 | 1020 > Key cache hit for sstable 3 | > 23:52:31,889 | 127.0.0.1 | 1108 > Seeking to partition beginning in data file | > 23:52:31,889 | 127.0.0.1 | 1117 > Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:52:31,889 | 127.0.0.1 | 1611 > Merging data from memtables and 1 sstables | > 23:52:31,890 | 127.0.0.1 | 1624 > Read 1 live and 0 tombstoned cells | > 23:52:31,890 | 127.0.0.1 | 1700 > Request complete | > 23:52:31,890 | 127.0.0.1 | 2140 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)