[jira] [Comment Edited] (CASSANDRA-7449) Variation of SELECT DISTINCT to find clustering keys with only static columns

graham sanderson (JIRA) Thu, 26 Jun 2014 10:02:05 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044857#comment-14044857
 ]


graham sanderson edited comment on CASSANDRA-7449 at 6/26/14 5:01 PM:
----------------------------------------------------------------------

Thanks for your comments.

Yeah a more general solution would be nice... it is possible but ugly to do 
today because with a static only row in this case:

{code}
cqlsh:test> select id, series_order, writetime(large_data) from 
expiring_series; 

 id  | series_order | writetime(large_data)
-----+--------------+-----------------------
 123 |         null |      1403801618700000    
{code}

but of course with data you get multiple rows per id (note I use writetime to 
avoid fetching large_data but check it is present and series_order nullness to 
detect lack of regular rows)

so 

{code}
select id, series_order, writetime(large_data) from expiring_series PLIMIT 1; 
{code}

or whatever would be fine

Note however that (and I'm sure you are thinking about this already), 
CASSANDRA-5762 continues to be a pain in terms of the efficiency of this 
query... my patch allows you to select just 2 rows the static one and one 
other... it seems you are able to guarantee the static column always appears 
first (due to all null keys I guess sorting first in all data types?), so here 
I guess PLIMIT or whatever would allow not fetching the entire row.




was (Author: graham sanderson):
Yeah a more general solution would be nice... it is possible but ugly to do 
today because with a static only row in this case:

{code}
cqlsh:test> select id, series_order, writetime(large_data) from 
expiring_series; 

 id  | series_order | writetime(large_data)
-----+--------------+-----------------------
 123 |         null |      1403801618700000    
{code}

but of course with data you get multiple rows per id (note I use writetime to 
avoid fetching large_data but check it is present and series_order nullness to 
detect lack of regular rows)

so 

{code}
select id, series_order, writetime(large_data) from expiring_series PLIMIT 1; 
{code}

or whatever would be fine

Note however that (and I'm sure you are thinking about this already), 
CASSANDRA-5762 continues to be a pain in terms of the efficiency of this 
query... my patch allows you to select just 2 rows the static one and one 
other... it seems you are able to guarantee the static column always appears 
first (due to all null keys I guess sorting first in all data types?), so here 
I guess PLIMIT or whatever would allow not fetching the entire row.



> Variation of SELECT DISTINCT to find clustering keys with only static columns
> -----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7449
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7449
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API, Core
>            Reporter: graham sanderson
>         Attachments: paging_broken_no_tests_v0.patch
>
>
> A possible use case for static columns involves (per partition) multiple 
> small TTL time series data values combined with a potentially much larger 
> static piece of data.
> While the TTL time series data will go away on its own, there is no way to 
> TTL the static data (and keep it updated with the latest TTL) without 
> re-inserting it every time to reset the TTL (which is undesirable since it is 
> large and unchanged)
> The use case looks something like this:
> {code}
> CREATE KEYSPACE test WITH replication = {
>   'class': 'SimpleStrategy',
>   'replication_factor': '1'
> };
> USE test;
> CREATE TABLE expiring_series (
>       id text,
>       series_order int,
>       small_data text,
>       large_data text static,
>       PRIMARY KEY (id, series_order)
> );
> INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large 
> and should not be inserted every time');
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 1, 
> 'antelope') USING TTL 120;
> // time passes (point A)
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 2, 
> 'gibbon') USING TTL 120;
> // time passes (point B)
> INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 3, 
> 'firebucket') USING TTL 120;
> // time passes (point C)
> // time passes and the first row expires (point D)
> // more time passes and eventually all the "rows" expire (point E)
> {code}
> GIven the way the storage engine works, there is no trivial way to make the 
> static column expire when the last row expires, however if there was an easy 
> way to find partitions with no regular rows (just static columns), then that 
> would make manual clean up easy
> The possible implementation of such a feature is very similar to SELECT 
> DISTINCT, so I'm suggesting SELECT STATICONLY
> Looking at the points again
> h4. Point A
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
>  id  | series_order | small_data | large_data                                 
>          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted 
> every time |             108
> (1 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point B
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
>  id  | series_order | small_data | large_data                                 
>          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted 
> every time |              87
>  123 |            2 |     gibbon | this is large and should not be inserted 
> every time |             111
> (2 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
> (2 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point C
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
>  id  | series_order | small_data | large_data                                 
>          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            1 |   antelope | this is large and should not be inserted 
> every time |              67
>  123 |            2 |     gibbon | this is large and should not be inserted 
> every time |              91
>  123 |            3 | firebucket | this is large and should not be inserted 
> every time |             110
> (3 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
>  123
> (3 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point D
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
>  id  | series_order | small_data | large_data                                 
>          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |            2 |     gibbon | this is large and should not be inserted 
> every time |              22
>  123 |            3 | firebucket | this is large and should not be inserted 
> every time |              41
> (2 rows)
> cqlsh:test> SELECT id FROM expiring_series;
>  id
> -----
>  123
>  123
> (2 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}
> h4. Point E
> {code}
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
>  id  | series_order | small_data | large_data                                 
>          | ttl(small_data)
> -----+--------------+------------+-----------------------------------------------------+-----------------
>  123 |         null |       null | this is large and should not be inserted 
> every time |            null
> (1 rows)
> cqlsh:test> SELECT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
>  id
> -----
>  123
> (1 rows)
> {code}
> Notice that after the last row has expired SELECT STATICONLY id returns the 
> row, and it can be deleted (under whatever concurrency rules the application 
> needs)
> {code}
> cqlsh:test> DELETE FROM expiring_series where id = '123';
> cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) 
> from expiring_series;
> (0 rows)
> cqlsh:test> SELECT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT DISTINCT id FROM expiring_series;
> (0 rows)
> cqlsh:test> SELECT STATICONLY id FROM expiring_series;
> (0 rows)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (CASSANDRA-7449) Variation of SELECT DISTINCT to find clustering keys with only static columns

Reply via email to