[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

Stefania (JIRA) Thu, 28 Jan 2016 01:52:41 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121128#comment-15121128
 ]


Stefania commented on CASSANDRA-8180:
-------------------------------------

Latest commit available 
[here|https://github.com/stef1927/cassandra/commit/a0e779d5970b764f2732c387067218407617b4ab].

bq. I am worried about the partitionLevelDeletion > minTimestamp check for 
tables with tombstones, though. It seems that the code is still willing to skip 
a table with tombstones if it doesn't delete the whole partition. Could that be 
right?

You are right to be worried, this test fails on trunk:

{code}
    createTable("CREATE TABLE %s (a int, b int, c text, primary key (a, b))");

    for (int i = 0; i < 10; i++)
        execute("INSERT INTO %s (a, b, c) VALUES(1, ?, 'abc')", i);
    flush();

    execute("DELETE FROM %s WHERE a=1 and b <= 3");
    flush();

    assertEmpty(execute("SELECT * FROM %s WHERE a=1 and b <= 2"));
{code}

I git-annotated the original code in 2.2, which is in {{CollationController}} 
at line 293 and traced it to CASSANDRA-5514. I think the comment explaining it 
is 
[this|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13645632&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13645632].
 However, perhaps back than range tombstones did not exist? BTW, [this 
comment|https://issues.apache.org/jira/browse/CASSANDRA-5514?focusedCommentId=13656934&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13656934]
 explains quite well the _raison d'être_ of the min and max clustering values 
in the stats.

bq. Shouldn't minTimestamp be updated only if shouldInclude()?

I agree and we should also update {{minTimestamp}} by looking at memtables as 
well now that CASSANDRA-9949 is available. 

bq. The trace on merging doesn't seem to be using the right value.

Done, thanks.

bq. You should use Transformation for withSSTablesIterated, and I'd actually do 
isEmpty() explicitly-- the new code is harder to read but I don't think it 
improves anything.

Done. Because {{BaseRows}} eagerly caches the static row in its constructor, I 
had to change one thing: rather than counting an sstable as iterated when the 
iterator is initialized, we do so when {{computeNext()}} is called. I think 
this is fair since the constructor only accesses the data partition in very 
limited cases as discussed above. However we do access the partition index, 
which prior to this we would only do for sstables with tombstones. If this is 
not acceptable then either we don't use {{Transformation}} or we change 
{{BaseRows}}.

I've also attached a boolean to the iterator itself rather than counting the 
tables via {{sstablesIterated}} and I was therefore able to get rid of 
{{QueryMemtableAndDiskStatus}}. It seems a bit cleaner like this. I called the 
method returning that boolean {{iterated()}} but perhaps we can find a better 
name.

bq. CQLTester.cfs() is redundant (use getCurrentColumnFamilyStore()).

Done, thanks.

> Optimize disk seek using min/max column name meta data when the LIMIT clause 
> is used
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8180
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local Write-Read Paths
>         Environment: Cassandra 2.0.10
>            Reporter: DOAN DuyHai
>            Assignee: Stefania
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: 8180_001.yaml, 8180_002.yaml
>
>
> I was working on an example of sensor data table (timeseries) and face a use 
> case where C* does not optimize read on disk.
> {code}
> cqlsh:test> CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
> WITH CLUSTERING ORDER BY (col DESC);
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
> ...
> >nodetool flush test test
> ...
> cqlsh:test> INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
> ...
> >nodetool flush test test
> {code}
> After that, I activate request tracing:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:48:46,498 | 127.0.0.1 |              0
>                             Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
> 23:48:46,498 | 127.0.0.1 |             74
>                                                        Preparing statement | 
> 23:48:46,499 | 127.0.0.1 |            253
>                                   Executing single-partition query on test | 
> 23:48:46,499 | 127.0.0.1 |            930
>                                               Acquiring sstable references | 
> 23:48:46,499 | 127.0.0.1 |            943
>                                                Merging memtable tombstones | 
> 23:48:46,499 | 127.0.0.1 |           1032
>                                                Key cache hit for sstable 3 | 
> 23:48:46,500 | 127.0.0.1 |           1160
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1173
>                                                Key cache hit for sstable 2 | 
> 23:48:46,500 | 127.0.0.1 |           1889
>                                Seeking to partition beginning in data file | 
> 23:48:46,500 | 127.0.0.1 |           1901
>                                                Key cache hit for sstable 1 | 
> 23:48:46,501 | 127.0.0.1 |           2373
>                                Seeking to partition beginning in data file | 
> 23:48:46,501 | 127.0.0.1 |           2384
>  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:48:46,501 | 127.0.0.1 |           2768
>                                 Merging data from memtables and 3 sstables | 
> 23:48:46,501 | 127.0.0.1 |           2784
>                                         Read 2 live and 0 tombstoned cells | 
> 23:48:46,501 | 127.0.0.1 |           2976
>                                                           Request complete | 
> 23:48:46,501 | 127.0.0.1 |           3551
> {code}
> We can clearly see that C* hits 3 SSTables on disk instead of just one, 
> although it has the min/max column meta data to decide which SSTable contains 
> the most recent data.
> Funny enough, if we add a clause on the clustering column to the select, this 
> time C* optimizes the read path:
> {code}
> cqlsh:test> SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1;
>  activity                                                                  | 
> timestamp    | source    | source_elapsed
> ---------------------------------------------------------------------------+--------------+-----------+----------------
>                                                         execute_cql3_query | 
> 23:52:31,888 | 127.0.0.1 |              0
>                Parsing SELECT * FROM test WHERE id=1 AND col > 25 LIMIT 1; | 
> 23:52:31,888 | 127.0.0.1 |             60
>                                                        Preparing statement | 
> 23:52:31,888 | 127.0.0.1 |            277
>                                   Executing single-partition query on test | 
> 23:52:31,889 | 127.0.0.1 |            961
>                                               Acquiring sstable references | 
> 23:52:31,889 | 127.0.0.1 |            971
>                                                Merging memtable tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1020
>                                                Key cache hit for sstable 3 | 
> 23:52:31,889 | 127.0.0.1 |           1108
>                                Seeking to partition beginning in data file | 
> 23:52:31,889 | 127.0.0.1 |           1117
>  Skipped 2/3 non-slice-intersecting sstables, included 0 due to tombstones | 
> 23:52:31,889 | 127.0.0.1 |           1611
>                                 Merging data from memtables and 1 sstables | 
> 23:52:31,890 | 127.0.0.1 |           1624
>                                         Read 1 live and 0 tombstoned cells | 
> 23:52:31,890 | 127.0.0.1 |           1700
>                                                           Request complete | 
> 23:52:31,890 | 127.0.0.1 |           2140
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max column name meta data when the LIMIT clause is used

Reply via email to