[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140749#comment-15140749 ] Aleksey Yeschenko commented on CASSANDRA-8180: -- Committed as [b11fba750c610de5e97acba070cc571cf0a96416|https://github.com/apache/cassandra/commit/b11fba750c610de5e97acba070cc571cf0a96416] to trunk (3.4), thanks. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.4 > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140087#comment-15140087 ] Stefania commented on CASSANDRA-8180: - CI looks good, this is ready to commit. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138226#comment-15138226 ] Stefania commented on CASSANDRA-8180: - Thanks for the review Branimir! The patch for trunk is [here|https://github.com/stef1927/cassandra/commits/8180], rebased and squashed. Launched one more CI round, if all good will move the ticket to ready to commit: ||trunk|| |[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-testall/]| |[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-dtest/]| > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT *
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136987#comment-15136987 ] Branimir Lambov commented on CASSANDRA-8180: Changes look good and patch looks good. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125613#comment-15125613 ] Stefania commented on CASSANDRA-8180: - Thank you for your comments, refer to [this commit|https://github.com/stef1927/cassandra/commit/a180f02ebf7679fac76a27cf7be752f310d343ff]: * Re-introduced a unique list of iterators and we filter by iterator type before counting the sstables iterated. * Overridden {{UnfilteredRowIteratorWithLowerBound.staticRow}}, and yes the problem was the single-source case. * Replaced the inline {{Transformation}} instance with a named class. * Modified the existing tests for static columns to also test when flushing to sstables * Fixed some trivial warnings and removed some duplication in test code ({{flush(boolean forceFlush)}}). CI is still running. I will also need to squash before we can flag as ready to commit. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > t
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123535#comment-15123535 ] Benedict commented on CASSANDRA-8180: - One other nit: for debugging it's much clearer if we create named instances of {{Transformation}} - these can be declared inline in the method, so it's only one extra line of code. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123275#comment-15123275 ] Branimir Lambov commented on CASSANDRA-8180: LGTM. Nits (feel free to ignore): You can use {{count}} instead of {{mapToInt(e -> 1).sum()}}. The {{ListUtils}} terminology is strange, {{union}} is the concatenation and {{sum}} is the union? It's also not very efficient. I don't think the two separate lists make a lot of sense versus using one with an {{instanceof}} check in {{onPartitionClose}}. bq. BaseRows eagerly caches the static row in its constructor Isn't {{mergeIterator.mergeStaticRows}} the one that does this? Or is the problem the single-source case? In any case, you could override {{UnfilteredRowIteratorWithLowerBound.staticRow}} to return empty if no static columns are required. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > --
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121128#comment-15121128 ] Stefania commented on CASSANDRA-8180: - Latest commit available [here|https://github.com/stef1927/cassandra/commit/a0e779d5970b764f2732c387067218407617b4ab]. bq. I am worried about the partitionLevelDeletion > minTimestamp check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right? You are right to be worried, this test fails on trunk: {code} createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))"); for (int i = 0; i < 10; i++) execute("INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')", i); flush(); execute("DELETE FROM %s WHERE a=1 and b <= 3"); flush(); assertEmpty(execute("SELECT * FROM %s WHERE a=1 and b <= 2")); {code} I git-annotated the original code in 2.2, which is in {{CollationController}} at line 293 and traced it to CASSANDRA-5514. I think the comment explaining it is [this|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13645632&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13645632]. However, perhaps back than range tombstones did not exist? BTW, [this comment|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13656934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13656934] explains quite well the _raison d'ĂȘtre_ of the min and max clustering values in the stats. bq. Shouldn't minTimestamp be updated only if shouldInclude()? I agree and we should also update {{minTimestamp}} by looking at memtables as well now that CASSANDRA-9949 is available. bq. The trace on merging doesn't seem to be using the right value. Done, thanks. bq. You should use Transformation for withSSTablesIterated, and I'd actually do isEmpty() explicitly-- the new code is harder to read but I don't think it improves anything. Done. Because {{BaseRows}} eagerly caches the static row in its constructor, I had to change one thing: rather than counting an sstable as iterated when the iterator is initialized, we do so when {{computeNext()}} is called. I think this is fair since the constructor only accesses the data partition in very limited cases as discussed above. However we do access the partition index, which prior to this we would only do for sstables with tombstones. If this is not acceptable then either we don't use {{Transformation}} or we change {{BaseRows}}. I've also attached a boolean to the iterator itself rather than counting the tables via {{sstablesIterated}} and I was therefore able to get rid of {{QueryMemtableAndDiskStatus}}. It seems a bit cleaner like this. I called the method returning that boolean {{iterated()}} but perhaps we can find a better name. bq. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()). Done, thanks. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119159#comment-15119159 ] Branimir Lambov commented on CASSANDRA-8180: Thank you, this now looks almost ready to go. This is not a problem of the new code, but I am worried about the [{{partitionLevelDeletion > minTimestamp}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R587] check for tables with tombstones, though. It seems that the code is still willing to skip a table with tombstones if it doesn't delete the whole partition. Could that be right? Nits: Shouldn't [{{minTimestamp}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R549] be updated only if {{shouldInclude()}}? [The trace on merging|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R607] doesn't seem to be using the right value. You should use {{Transformation}} for [{{withSSTablesIterated}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-2e17efa5977a71330df6651d3bec0d12R664], and I'd actually do {{isEmpty()}} explicitly-- the new code is harder to read but I don't think it improves anything. [{{CQLTester.cfs()}}|https://github.com/apache/cassandra/compare/trunk...stef1927:8180#diff-8693101ca0a55a372ef784230968648aR397] is redundant (use {{getCurrentColumnFamilyStore()}}). > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118892#comment-15118892 ] Stefania commented on CASSANDRA-8180: - The latest commit is ready for review [here|https://github.com/stef1927/cassandra/commit/fb1a4d93dc101143eb3d4062d577e643e8824c73]. CI is still pending however. bq. Isn't the static row is outside the iteration? That's a very good point, thanks for pointing this out. So we don't need to worry about this. bq. So we can't really trust hasTombstones()? This may be a significant bug. This method was introduced in this patch but currently the same logic is used in {{SinglePartitionReadCommand.queryMemtableAndDiskInternal()}} and {{queryMemtableAndSSTablesInTimestampOrder}} to determine which non overlapping sstables contain no tombstones at all, see especially the comment in the second method: {code} // This mean that nothing queried by the filter can be in the sstable. One exception is the top-level partition deletion // however: if it is set, it impacts everything and must be included. Getting that top-level partition deletion costs us // some seek in general however (unless the partition is indexed and is in the key cache), so we first check if the sstable // has any tombstone at all as a shortcut. if (sstable.getSSTableMetadata().maxLocalDeletionTime == Integer.MAX_VALUE) continue; // Means no tombstone at all, we can skip that sstable {code} The trouble is that in other places we assume the opposite, for example {{CompactionController.getFullyExpiredSSTables()}} requires {{maxDeletionTime}} to be {{Integer.MAX_VALUE}} in the presence of live cells or else it might expire the sstable. I think the code in {{SinglePartitionReadCommand}} is wrong and we should have looked at the {{minDeletionTime}} instead, so I've updated the new method accordingly. I've had some failures with upgrade dtests, due to the clustering values of old sstables ("jb" format) containing more values than the comparator types. I've added a check on {{Version.hasNewStatsFile()}} to prevent this and some assertions to spot it sooner, and this seems to have fixed the failing tests. To wrap up the discussion on upgrading sstable stats, this would allow applying this optimization also to sstables with tombstones and that are smaller than {{DatabaseDescriptor.column_index_size_in_kb}}, 64kb by default. For larger sstables with tombstones, the lower bound comes from the partition index and this will be read anyway in the presence of tombstones when {{partitionDeletion()}} is called. This requires also initializing the iterator, which will cause a disk seek to the partition start if the partition index is not available (sstables smaller than 64kb) or there are static columns in the query, see the {{AbstractSSTableIterator}} constructor. Therefore I don't see the point in upgrading the stats, but should we wish to do it, here is the work required: - upgrade the sstable version from "ma" to "mb" (unsure what this entails exactly, other than changing {{BigVersion}} and {{BigFormat}}) - introduce a serializer for clustering prefixes that is able to distinguish between clusterings and bounds by using functionality similar to what's in {{UnfilteredSerializer}} - pass the clustering types, or ideally the {{SerializationHeader}}, to the {{StatsMetadata}} serialize and deserialize methods - or alternatively add the min and max clustering prefixes to {{SerializationHeader.Component}}, which is also a metadata component. However in this case we'd have to do some more work to pass to it the min and max clustering prefixes, which can instead be easily collected for {{StatsMetadata}} by using the existing method {{MetadataCollector.updateClusteringValues()}}. I suppose they could also go into {{EncodingStats}}, which is serialized in the header, but I am not sure if this would increase the size of partitions excessively because it seems we also use {{EncodingStats}} for partitions. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,c
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15117151#comment-15117151 ] Branimir Lambov commented on CASSANDRA-8180: bq. Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in {{SinglePartitionReadCommand.shouldInclude()}}. Isn't the static row is outside the iteration? Unless I'm very mistaken It shouldn't enter {{MergeIterator}} at all, and thus the bound needn't be corrected for it. Additionally, if static rows are needed the merge will query each iterator's {{staticRow()}} method which will initialize the iterator. bq. It is enabled when there are no tombsones since the iterator is null. Sorry, my bad, I have overlooked {{partitionLevelDeletion()}} not calling the super method in that case. bq. It's once again the metadata that is wrong: maxLocalDeletionTime is Integer.MAX_VALUE even though there are tombstones in the sstable, as a consequence sstable.hasTombstones() returns an incorrect value. So we can't really trust {{hasTombstones()}}? This may be a significant bug. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if w
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116897#comment-15116897 ] Stefania commented on CASSANDRA-8180: - I think the issue is in {{MetadataCollector}}: {code} public void update(Cell cell) { updateTimestamp(cell.timestamp()); updateTTL(cell.ttl()); updateLocalDeletionTime(cell.localDeletionTime()); } {code} Shouldn't we check that the cell is a tombstone before updating the local deletion time? > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116781#comment-15116781 ] Stefania commented on CASSANDRA-8180: - bq. It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a Bound.BOTTOM. It's once again the metadata that is wrong: {{maxLocalDeletionTime}} is {{Integer.MAX_VALUE}} even though there are tombstones in the sstable, as a consequence {{sstable.hasTombstones()}} returns an incorrect value. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116655#comment-15116655 ] Stefania commented on CASSANDRA-8180: - bq. That is, your code is currently completely disabled by the {{if (... iterator != null) return null; check in lowerBound()?}} Is there a reason to want the code committed then? It is enabled when there are no tombsones since the iterator is null. However we can do better, when there are tombstones we can still use the row index bound if there is one. bq. On the other hand, an assertion in {{UnfilteredIteratorWithLowerBound}} that saves the returned {{lowerBound()}} and checks the first {{next()}} to be greater-than or equal may be something we want to have anyway. I've added this. Whilst the test you proposed is fine, existing tests (e.g. {{DeleteTest.testDeleteWithRangeAndOneClusteringColumn()}}) started to fail but _only when compaction is enabled_. It seems we can have sstables with no tombstones, valid clustering values (e.g. 0) and yet the first item is a {{Bound.BOTTOM}}. Further, it seems that static columns could also have problems with clustering values so it's not just tombstones. See the comment in {{SinglePartitionReadCommand.shouldInclude()}}. Moving back to in progress, might end up modifying sstable metadata format after all. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115132#comment-15115132 ] Branimir Lambov commented on CASSANDRA-8180: bq. iter.partitionLevelDeletion() is currently called for all sstables by queryMemtableAndDiskInternal() That is, your code is currently completely disabled by the [{{if (... iterator != null) return null;}}|https://github.com/stef1927/cassandra/commit/d5cfc6fd56d50eda5d9c510591bae1d66e17ec59#diff-78de604e500e1cc63c6d53a2ac6d6d65R52] check in {{lowerBound()}}? Is there a reason to want the code committed then? bq. We can replace BTreeRow.emptyRow(ret) with new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE) if there is still a valid reason and ideally a failing test would be useful. It is not as easy to fully break it as I was expecting, but in the presence of tombstones you can still break a basic feature of the iterators -- the inequality of the returned elements. A test that does {code} createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))"); execute("INSERT INTO %s (a, b, c) VALUES(1, 1, '1')"); execute("INSERT INTO %s (a, b, c) VALUES(1, 3, '3')"); execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3"); execute("INSERT INTO %s (a, b, c) VALUES(1, 2, '2')"); flush(); execute("DELETE FROM %s where a=1 and b >= 2 and b <= 3"); flush(); execute("SELECT * FROM %s WHERE a = ?", 1); {code} will end up with an iterator that lists two tombstone markers with equal clustering. Unfortunately that's filtered out before being returned so it's not trivial to write a test that checks this. On the other hand, an assertion in {{UnfilteredIteratorWithLowerBound}} that saves the returned {{lowerBound()}} and checks the first {{next()}} to be greater-than or equal may be something we want to have anyway. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114909#comment-15114909 ] Stefania commented on CASSANDRA-8180: - Pushed a [commit|https://github.com/stef1927/cassandra/commit/d5cfc6fd56d50eda5d9c510591bae1d66e17ec59] where we don't use the lower bound in the presence of tombstones and {{DeleteTest}} is now passing. CI is still pending however. I would like to point out that {{iter.partitionLevelDeletion()}} is currently called for all sstables by {{queryMemtableAndDiskInternal()}} and therefore, in the presence of tombstones, we have to access the sstable anyway. bq. Tombstones. A DELETE WHERE pk = ? AND ck1 = ? in a table with key (pk, ck1, ck2) will generate one. This case existed already in {{DeleteTest.testDeleteWithRangeAndTwoClusteringColumns()}}. It does not fail though, because the clustering comparator compares prefix values from first to last and so it works fine with incomplete prefixes. bq. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions. I don't think this matters any longer if we never replace a tombstone with a lower bound but, I could not reproduce any failures even when using a lower bound for tombstones (other than for one sided range tombstones). Here is what I tried: - add a row, flush, delete same row, flush again => FINE (we don't even use merge iterator) - add two rows in same partition, flush, delete one and flush, delete other one and flush, FINE again - add three rows in same partition, flush, delete one and flush, delete other one and flush, FINE again We can replace {{BTreeRow.emptyRow(ret)}} with {{new RangeTombstoneBoundMarker(RangeTombstone.Bound.inclusiveOpen(filter.isReversed(), ret.getRawValues()), DeletionTime.LIVE)}} if there is still a valid reason and ideally a failing test would be useful. I will clean up {{lowerBound()}} and add some comments to it in the next round once we have reached a decision. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginni
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110332#comment-15110332 ] Stefania commented on CASSANDRA-8180: - bq. Tombstones. A {{DELETE WHERE pk = ? AND ck1 = ?}} in a table with key {{(pk, ck1, ck2)}} will generate one. I will add this test, thanks. bq. So one solution is not to use the min/maxClusteringValues in that case. So maybe we could simply return a null lower bound in the presence of tombstones or is this too much of a compromise? bq. No need for that, you can set a flag in Version to tell you whether or not the information is present. Doesn't this only indicate a different sstable version ("la", "ma", etc)? bq. This is the fix. Instead of an empty row, the lower bound should be a RangeTombstoneBound as described. Thanks. bq. I would just move finding the bound to lowerBound(), and I don't think it's even necessary to save the bound OK > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110307#comment-15110307 ] Branimir Lambov commented on CASSANDRA-8180: bq. What is an example of any other incomplete prefix and do we have a gap in the tests then? Tombstones. A {{DELETE WHERE pk = ? AND ck1 = ?}} in a table with key {{(pk, ck1, ck2)}} will generate one. bq. What I don't understand is how things like shouldInclude() in ClusteringIndexNamesFilter or ClusteringIndexSliceFilter work. If you look at the callsites for the method, you will see that they do more work in the presence of tombstones. So one solution is not to use the {{min/maxClusteringValues}} in that case. bq. \[MetadataSerializer.deserialize()\] should receive the total size to work out if there is more stuff to read at the end. No need for that, you can set a flag in {{Version}} to tell you whether or not the information is present. bq. I'm not sure what you mean, \[use a RangeTombstoneBound\] for the test or the fix? This is the fix. Instead of an empty row, the lower bound should be a {{RangeTombstoneBound}} as described. bq. The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. In the way you use this class, by the time {{lowerBound()}} is called, all of this is already done (by {{UnfilteredRowMergeIterator.create}}), possibly unnecessarily (if {{MergeIterator.OneToOne}} is to be used). I would just move finding the bound to {{lowerBound()}}, and I don't think it's even necessary to save the bound-- just retrieve it there, the method won't be called more than once. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109971#comment-15109971 ] Stefania commented on CASSANDRA-8180: - Thank you for the review. bq. updateClusteringValues() doesn't produce what you need, as it is not just TOP and BOTTOM that are mishandled, any incomplete prefix will also be. What you need is an actual min/maxClusteringPrefix which is different from the min/max of each component separately. I agree that we need a full clustering prefix. What I don't understand is how things like {{shouldInclude()}} in {{ClusteringIndexNamesFilter}} or {{ClusteringIndexSliceFilter}} work. What is an example of any other incomplete prefix and do we have a gap in the tests then? bq. I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data. There are 4 components that we write in the same file: VALIDATION, STATS, COMPACTION and HEADER so we can't simply keep on reading till the end of the file. However we write a TOC with the position of each component. So it should be possible but it would require changes to {{MetadataSerializer.deserialize()}} and the signature of {{IMetadataComponentSerializer.deserialize()}}, which should receive the total size to work out if there is more stuff to read at the end. I guess we can go for it. bq. I personally would prefer to not modify the behaviour of MergeIterator and keep it doing one simple thing, but this approach does have its charm. The changes to MergeIterator are mostly in the candidate and really minimal, the actual algorithm is unchanged. However if you do have a less invasive approach in mind I'm eager to hear it. bq. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Thanks, I'll add this test. bq. Use a RangeTombstoneBound with DeletionTime.LIVE as the deletion time and a bound obtained by RangeTombstone.Bound.inclusiveOpen, which should do the right thing in both directions. I'm not sure what you mean, do this for the test or the fix? I'm sure I'll work it out when I write the test though. bq. IMergeIterator.LowerBound is cryptic, rename it to IteratorWithLowerBound to be explicit about its purpose. OK bq. The choice to set rowIndexLowerBound in partitionLevelDeletion() appears very arbitrary and fragile. What is the reason to do it separately from globalLowerBound? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time? The global lower bound is free, since it is available in the metadata. The index lower bound is more accurate but it requires seeking the index file. Calling {{super.partitionLevelDeletion()}} also involves initializing the iterator and accessing the data file (AbstractSSTableIterator constructor). So we decided to use this more accurate bound only when we really have to access the index anyway, that is when partitionLevelDeletion() is called and there are tombstones. See [this comment|https://issues.apache.org/jira/browse/CASSANDRA-8180?focusedCommentId=14388301&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14388301] above. I hope to resume this in the next few days. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109052#comment-15109052 ] Branimir Lambov commented on CASSANDRA-8180: I personally would prefer to not modify the behaviour of {{MergeIterator}} and keep it doing one simple thing, but this approach does have its charm. An empty row will not work correctly as a lower bound. It does not sort as needed with respect to tombstone bounds, which should also be included in the test (more specifically, one that adds a row, flushes, deletes same row, flushes again, then checks if it resurfaces-- I believe this would break with the current code). Use a {{RangeTombstoneBound}} with {{DeletionTime.LIVE}} as the deletion time and a bound obtained by {{RangeTombstone.Bound.inclusiveOpen}}, which should do the right thing in both directions. {{IMergeIterator.LowerBound}} is cryptic, rename it to {{IteratorWithLowerBound}} to be explicit about its purpose. The choice to set {{rowIndexLowerBound}} in {{partitionLevelDeletion()}} appears very arbitrary and fragile. What is the reason to do it separately from {{globalLowerBound}}? In fact, why have two separate bounds instead of one, set from the most precise information that is available at construction time? > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead o
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15108363#comment-15108363 ] Branimir Lambov commented on CASSANDRA-8180: {{updateClusteringValues()}} doesn't produce what you need, as it is not just {{TOP}} and {{BOTTOM}} that are mishandled, any incomplete prefix will also be. What you need is an actual {{min/maxClusteringPrefix}} which is different from the min/max of each component separately. I don't know if it is possible to append the data you need at the end of the stats component and have earlier versions happily ignore that data. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106233#comment-15106233 ] Stefania commented on CASSANDRA-8180: - Rebased and squashed [~blambov]: ||trunk|| |[patch|https://github.com/stef1927/cassandra/commits/8180]| |[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-testall/]| |[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-8180-dtest/]| There is one issue, {{RangeTombstone.Bound.TOP}} and {{BOTTOM}} are not currently updating the min and max clustering values of {{StatsMetadata}} since their clustering values are empty arrays, see {{MetadataCollector.updateClusteringValues()}} and {{BTW.applyToMarker()}}. As a result, [some range tombstone tests|http://cassci.datastax.com/job/stef1927-8180-testall/1/testReport/junit/junit/] are failing. It's not clear to me how to capture this information in the metadata without changing its format given that the smallest / biggest values would depend on the individual clustering types, cc [~slebresne] as well. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* o
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096476#comment-15096476 ] Stefania commented on CASSANDRA-8180: - Will rebase and also try to see if we can leverage CASSANDRA-9975 to avoid wrapping iterators. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096458#comment-15096458 ] Sylvain Lebresne commented on CASSANDRA-8180: - [~Stefania] Could you have a shot at rebasing this at some point, and [~blambov] to have another look at it once that's done. I'd hate for that code to get so outdated that its too much effort to rebase. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Local Write-Read Paths > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642236#comment-14642236 ] Stefania commented on CASSANDRA-8180: - [~blambov] I've pushed a new commit where I ensure that lower bounds always compare to less than real values, when the clustering is the same. This means the existing merge iterator algorithm is now almost unchanged, the only difference is that I moved consume() into the candidate, where we make sure the lower bounds are never consumed by the reducer. We still use empty rows as lower bounds, but they are never used outside of the merge iterator candidate. We could use a specialized {{Unfiltered}} if it really bothers you however. Can you take another look? As for performance, with this test the 8180 branch is ahead: {code} user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(insert=1,\) n=5M -rate threads=300 -insert revisit=uniform\(1..100\) visits=fixed\(25\) -pop seq=1..1K read-lookback=uniform\(1..1K\) contents=SORTED {code} http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1609.08&ymin=0&ymax=4637.6 The read command is unchanged and performance is similar or maybe still better on trunk: {code} user profile=https://dl.dropboxusercontent.com/u/15683245/8180.yaml ops\(singleval=1,\) n=5M -rate threads=300 {code} http://cstar.datastax.com/graph?stats=094f57cc-3409-11e5-bd2b-42010af0688f&metric=op_rate&operation=2_user&smoothing=1&show_aggregates=true&xmin=0&xmax=35.53&ymin=0&ymax=188763.3 I think we are still visiting all sstables on the second command because the global bounds are probably the same. Profile is attached as 8180_002.yaml. I've also noticed with flight recorder that {{BigTableReader.getPosition()}} and {{Tracing.trace()}} are hotspots, both on trunk and on 8180, we should probably optimize them. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180_001.yaml, 8180_002.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640475#comment-14640475 ] Stefania commented on CASSANDRA-8180: - bq. I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering. The idea is that these fake Clusterable instances never leave the merge iterators. I can create a sub-class of Clusterable for clarity but it's safe to use an empty row as it is only returned as a lower bound and not as an iterated value. Have you looked at the version of the code I pushed today, where the candidate takes the lower bound from the iterator? At the moment the reducer decides if it should ignore it by checking if the row is empty. This is wrong it should be the candidate that decides that it should be ignored and the next one returned instead, except this requires a new comparison. bq. Additionally, I see no changes in MergeIterator.OneToOne or TrivialOneToOne, do they not need updates? No they don't need updating, they never use the lower bound. bq. I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content... Yes this would solve the problem just mentioned above where we need to check if there is an equal real value immediately afterwards. I'll work on this idea, thank you very much! > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640337#comment-14640337 ] Branimir Lambov commented on CASSANDRA-8180: I have some worries that relate to the use of a row as the lower bound: for one there is a small risk that having two equal values produced by an iterators could be mishandled by some existing or future code, and as a second I do not know what would happen if an sstable's content starts with a range tombstone, which sorts before rows with equal clustering. Additionally, I see no changes in {{MergeIterator.OneToOne}} or {{TrivialOneToOne}}, do they not need updates? I believe the solution could be simpler and cleaner if the bound is given as a distinct special value that sorts before any content (similar to how range tombstone bounds sort). The existence of these special values needs only concern the arguments given to {{MergeIterator}}: the comparator, which should know to sort them before anything else with the same key (including tombstone bounds), as well as the reducer, which can then be certain that a bound is not going to be attempted to be combined with anything else, and can safely ignore it by perhaps returning null at {{getReduced}}; if necessary bounds can be easily made unequal to each other as well. {{MergeIterator}} will then pop bounds before content rather than with it, and will not need not be modified at all. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request comp
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640223#comment-14640223 ] Benedict commented on CASSANDRA-8180: - Support for that isn't fantastic, but instead of providing a select chance of 0.001, you can use the {{-insert revisit visits contents}}, and {{-pop read-lookback}} options. * {{visits}} tells stress to split inserts for a single partition up into multiple visits, performing a roughly equal number of inserts per visit, but ultimately inserting the entirety of the partition's contents * {{revisits}} tells stress _how many_ of these split-up inserts to maintain at once, and in what distribution they should be revisited (i.e. exp(1..1M) would mean the most recent to be inserted would be exponentially more likely to be visited than the oldest; uniform(1..1M) would make them all uniformly likely; both would limit the number of "in flight" inserts to 1M) * {{read-lookback}} tells reads to operate only over those partitions that have been inserted, and only their portion that has been inserted * {{contents=SORTED}} sorts the partition data before insert. This is expensive if there are many items, so you should ensure no single clustering column generates more than a couple of dozen values. If necessary, introduce more clustering columns. The main problem is that these features are not well (if at all) tested, so you may encounter problems. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 2
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640206#comment-14640206 ] Stefania commented on CASSANDRA-8180: - Thanks, make sure you have the latest code as I have rebased and fixed a couple of things today. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640202#comment-14640202 ] Stefania commented on CASSANDRA-8180: - We definitely need a place to archive these stress profiles, attaching them to the tickets like I've done so far is not very efficient. CASSANDRA-8503 aims to collect important stress profiles for regression testing, perhaps we can continue the discussion there? In terms of this specific performance optimization, so far things aren't going too well. The problem is that I don't know how to create a 'timeseries' profile with cassandra-stress, is this possible [~benedict]? We basically need to have many clustering rows per partition and they must be ordered. With the profile attached, at best I have been able to show that we are not worse: http://cstar.datastax.com/tests/id/ac8c686c-31de-11e5-95c3-42010af0688f However, when I increased the data from 1M to 5M and changed the primary key population distribution from 1..100K to 1B, I ended up with something worse: http://cstar.datastax.com/tests/id/11dc0080-31dd-11e5-80f3-42010af0688f So I think there is still some work to do. I have also some weirdness in the flight recorder profiles, which I have not attached due to their size but I can do if anyone is interested. I fixed two hot-spots today but there must be at least one more problem. The whole approach of a lazy wrapping iterator is a bit fragile IMO, all it takes is to call a method too soon and the entire optimization is lost, except the overhead of the wrapper iterator and the increased complexity in the merge iterator remains. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > Attachments: 8180.yaml > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640182#comment-14640182 ] Branimir Lambov commented on CASSANDRA-8180: Yes, I have started looking at it now. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 > Acquiring sstable references | > 23:52
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637037#comment-14637037 ] Joshua McKenzie commented on CASSANDRA-8180: Do we have a place where we're archiving these stress profiles? Seems like it would be helpful to have a profile paired with a small comment on what specific setup/portions of the system is stresses for future use. Also can socket those into the perf harness when that's ready. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,88
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636584#comment-14636584 ] Stefania commented on CASSANDRA-8180: - Noted. I'll work on a suitable stress profile and then attach some flight recorder or cstar_perf comparisons. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636566#comment-14636566 ] Sylvain Lebresne commented on CASSANDRA-8180: - Side note: would be nice to have some benchmarks validating the benefits of this before we commit, since that's a performance improvement. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636538#comment-14636538 ] Stefania commented on CASSANDRA-8180: - Thanks. [~blambov] would you have time to review this or shall we wait for Sylvain's return? > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 > Executing single-partition query on test | > 23:52:31,889 | 127.0.0.1 |961 >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636514#comment-14636514 ] Sylvain Lebresne commented on CASSANDRA-8180: - bq. Sylvain Lebresne are you happy to still be the reviewer or do you want to suggest someone else? I'm leaving on vacation at the end of the week and have enough on my plate until then that it's safe to assume I won't have time to look at this one. I'm happy to have a look when I'm back and have a tad more time, but it's probably a good idea to find another reviewer in the meantime in case that new reviewer has time to get to it sooner. I would suggest Branimir if he has some cycles since that ticket has a lot to do with {{MergeIterator}} and he has been dealing with that quite a bit lately. > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ >
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636477#comment-14636477 ] Stefania commented on CASSANDRA-8180: - I have attached the latest patch based on trunk, which is ready for review. The main things to note are: * In MergeIterator we need to discard fake lower bound values and we must advance the same iterator to check that there isn't another real value with the exact same value, in which case we must pass it to the reducer together with the other equal values from other iterators. However, we cannot just discard fake values too soon, i.e. in ManyToOne.advance() we cannot just advance as long as we have fake values, we need to wait for the heap to be sorted first. I implemented a [peek method|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-a9e2c345aa605d1b8d360b4c44ade32f] in the candidate, this is called for fake values when reducing values. This should not affect the correctness of the algorithm but there may be a more efficient way to do this, cc [~benedict] and [~blambov]. * We need to decide if we are happy with a wrapping iterator, [LowerBoundUnfilteredRowIterator|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-1cdf42ebc69336015e04f287e8450e51], in which case we may need a better name. I understand that wrapping too many iterators may hurt performance so we may want to look into modifying AbstractSSTableIterator directly, even though this might be a bit more work. I also wrap the merged iterator to make sure we update the metrics of iterated tables in the close method, when we know how many tables were iterated. It would be nice to at least save this wrapped iterator. * We need a reliable way to signal a fake Unfiltered object (the lower bound). I used an [empty row|https://github.com/stef1927/cassandra/commit/9528e672ea65c7a71c0004adfc27e5f4d9ee0acb#diff-3e7088b7213c9faaf80e18dadaaa6929] but perhaps we should use a new specialization for Unfiltered or something else. [~slebresne] are you happy to still be the reviewer or do you want to suggest someone else? > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.x > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 190
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516104#comment-14516104 ] Stefania commented on CASSANDRA-8180: - Rebased to latest 8099 branch. Test results will be available here: http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-testall/ http://cassci.datastax.com/view/Dev/view/Stefania/job/stef1927-8180-8099-dtest/ > Optimize disk seek using min/max column name meta data when the LIMIT clause > is used > > > Key: CASSANDRA-8180 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 > Project: Cassandra > Issue Type: Improvement > Components: Core > Environment: Cassandra 2.0.10 >Reporter: DOAN DuyHai >Assignee: Stefania >Priority: Minor > Fix For: 3.0 > > > I was working on an example of sensor data table (timeseries) and face a use > case where C* does not optimize read on disk. > {code} > cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) > WITH CLUSTERING ORDER BY (col DESC); > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); > ... > >nodetool flush test test > ... > cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); > ... > >nodetool flush test test > {code} > After that, I activate request tracing: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:48:46,498 | 127.0.0.1 | 0 > Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | > 23:48:46,498 | 127.0.0.1 | 74 >Preparing statement | > 23:48:46,499 | 127.0.0.1 |253 > Executing single-partition query on test | > 23:48:46,499 | 127.0.0.1 |930 > Acquiring sstable references | > 23:48:46,499 | 127.0.0.1 |943 >Merging memtable tombstones | > 23:48:46,499 | 127.0.0.1 | 1032 >Key cache hit for sstable 3 | > 23:48:46,500 | 127.0.0.1 | 1160 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1173 >Key cache hit for sstable 2 | > 23:48:46,500 | 127.0.0.1 | 1889 >Seeking to partition beginning in data file | > 23:48:46,500 | 127.0.0.1 | 1901 >Key cache hit for sstable 1 | > 23:48:46,501 | 127.0.0.1 | 2373 >Seeking to partition beginning in data file | > 23:48:46,501 | 127.0.0.1 | 2384 > Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | > 23:48:46,501 | 127.0.0.1 | 2768 > Merging data from memtables and 3 sstables | > 23:48:46,501 | 127.0.0.1 | 2784 > Read 2 live and 0 tombstoned cells | > 23:48:46,501 | 127.0.0.1 | 2976 > Request complete | > 23:48:46,501 | 127.0.0.1 | 3551 > {code} > We can clearly see that C* hits 3 SSTables on disk instead of just one, > although it has the min/max column meta data to decide which SSTable contains > the most recent data. > Funny enough, if we add a clause on the clustering column to the select, this > time C* optimizes the read path: > {code} > cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; > activity | > timestamp| source| source_elapsed > ---+--+---+ > execute_cql3_query | > 23:52:31,888 | 127.0.0.1 | 0 >Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | > 23:52:31,888 | 127.0.0.1 | 60 >Preparing statement | > 23:52:31,888 | 127.0.0.1 |277 >