[ https://issues.apache.org/jira/browse/CASSANDRA-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14044857#comment-14044857 ]
graham sanderson edited comment on CASSANDRA-7449 at 6/26/14 5:00 PM: ---------------------------------------------------------------------- Yeah a more general solution would be nice... it is possible but ugly to do today because with a static only row in this case: {code} cqlsh:test> select id, series_order, writetime(large_data) from expiring_series; id | series_order | writetime(large_data) -----+--------------+----------------------- 123 | null | 1403801618700000 {code} but of course with data you get multiple rows per id (note I use writetime to avoid fetching large_data but check it is present and series_order nullness to detect lack of regular rows) so {code} select id, series_order, writetime(large_data) from expiring_series PLIMIT 1; {code} or whatever would be fine Note however that (and I'm sure you are thinking about this already), CASSANDRA-5762 continues to be a pain in terms of the efficiency of this query... my patch allows you to select just 2 rows the static one and one other... it seems you are able to guarantee the static column always appears first (due to all null keys I guess sorting first in all data types?), so here I guess PLIMIT or whatever would allow not fetching the entire row. was (Author: graham sanderson): Yeah a more general solution would be nice... it is possible but ugly to do because a static only row in this case {code} cqlsh:test> select id, series_order, writetime(large_data) from expiring_series; id | series_order | writetime(large_data) -----+--------------+----------------------- 123 | null | 1403801618700000 {code} but of course with data you get multiple rows per id (note I use writetime to avoid fetching large_data but check it is present and series_order nullness to detect lack of regular rows) so {code} select id, series_order, writetime(large_data) from expiring_series PLIMIT 1; {code} or whatever would be fine Note however that (and I'm sure you are thinking about this already), CASSANDRA-5762 continues to be a pain in terms of the efficiency of this query... my patch allows you to select just 2 rows the static one and one other... it seems you are able to guarantee the static column always appears first (due to all null keys I guess sorting first in all data types?), so here I guess PLIMIT or whatever would allow not fetching the entire row. > Variation of SELECT DISTINCT to find clustering keys with only static columns > ----------------------------------------------------------------------------- > > Key: CASSANDRA-7449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7449 > Project: Cassandra > Issue Type: Improvement > Components: API, Core > Reporter: graham sanderson > Attachments: paging_broken_no_tests_v0.patch > > > A possible use case for static columns involves (per partition) multiple > small TTL time series data values combined with a potentially much larger > static piece of data. > While the TTL time series data will go away on its own, there is no way to > TTL the static data (and keep it updated with the latest TTL) without > re-inserting it every time to reset the TTL (which is undesirable since it is > large and unchanged) > The use case looks something like this: > {code} > CREATE KEYSPACE test WITH replication = { > 'class': 'SimpleStrategy', > 'replication_factor': '1' > }; > USE test; > CREATE TABLE expiring_series ( > id text, > series_order int, > small_data text, > large_data text static, > PRIMARY KEY (id, series_order) > ); > INSERT INTO expiring_series (id, large_data) VALUES ('123', 'this is large > and should not be inserted every time'); > INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 1, > 'antelope') USING TTL 120; > // time passes (point A) > INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 2, > 'gibbon') USING TTL 120; > // time passes (point B) > INSERT INTO expiring_series (id, series_order, small_data) VALUES ('123', 3, > 'firebucket') USING TTL 120; > // time passes (point C) > // time passes and the first row expires (point D) > // more time passes and eventually all the "rows" expire (point E) > {code} > GIven the way the storage engine works, there is no trivial way to make the > static column expire when the last row expires, however if there was an easy > way to find partitions with no regular rows (just static columns), then that > would make manual clean up easy > The possible implementation of such a feature is very similar to SELECT > DISTINCT, so I'm suggesting SELECT STATICONLY > Looking at the points again > h4. Point A > {code} > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > id | series_order | small_data | large_data > | ttl(small_data) > -----+--------------+------------+-----------------------------------------------------+----------------- > 123 | 1 | antelope | this is large and should not be inserted > every time | 108 > (1 rows) > cqlsh:test> SELECT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > (0 rows) > {code} > h4. Point B > {code} > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > id | series_order | small_data | large_data > | ttl(small_data) > -----+--------------+------------+-----------------------------------------------------+----------------- > 123 | 1 | antelope | this is large and should not be inserted > every time | 87 > 123 | 2 | gibbon | this is large and should not be inserted > every time | 111 > (2 rows) > cqlsh:test> SELECT id FROM expiring_series; > id > ----- > 123 > 123 > (2 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > (0 rows) > {code} > h4. Point C > {code} > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > id | series_order | small_data | large_data > | ttl(small_data) > -----+--------------+------------+-----------------------------------------------------+----------------- > 123 | 1 | antelope | this is large and should not be inserted > every time | 67 > 123 | 2 | gibbon | this is large and should not be inserted > every time | 91 > 123 | 3 | firebucket | this is large and should not be inserted > every time | 110 > (3 rows) > cqlsh:test> SELECT id FROM expiring_series; > id > ----- > 123 > 123 > 123 > (3 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > (0 rows) > {code} > h4. Point D > {code} > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > id | series_order | small_data | large_data > | ttl(small_data) > -----+--------------+------------+-----------------------------------------------------+----------------- > 123 | 2 | gibbon | this is large and should not be inserted > every time | 22 > 123 | 3 | firebucket | this is large and should not be inserted > every time | 41 > (2 rows) > cqlsh:test> SELECT id FROM expiring_series; > id > ----- > 123 > 123 > (2 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > (0 rows) > {code} > h4. Point E > {code} > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > id | series_order | small_data | large_data > | ttl(small_data) > -----+--------------+------------+-----------------------------------------------------+----------------- > 123 | null | null | this is large and should not be inserted > every time | null > (1 rows) > cqlsh:test> SELECT id FROM expiring_series; > (0 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > id > ----- > 123 > (1 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > id > ----- > 123 > (1 rows) > {code} > Notice that after the last row has expired SELECT STATICONLY id returns the > row, and it can be deleted (under whatever concurrency rules the application > needs) > {code} > cqlsh:test> DELETE FROM expiring_series where id = '123'; > cqlsh:test> SELECT id, series_order, small_data, large_data, ttl(small_data) > from expiring_series; > (0 rows) > cqlsh:test> SELECT id FROM expiring_series; > (0 rows) > cqlsh:test> SELECT DISTINCT id FROM expiring_series; > (0 rows) > cqlsh:test> SELECT STATICONLY id FROM expiring_series; > (0 rows) > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)