[ https://issues.apache.org/jira/browse/CASSANDRA-6588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147545#comment-15147545 ]
Gianluca Borello commented on CASSANDRA-6588: --------------------------------------------- Thanks for the comment. As you predicted, this makes a night and day difference. Running my benchmark script (the one I explained in the mailing list thread) on 3.3 gives: Response time for querying a single column on a large table (column size 10 MB): 10 columns: 236 ms 20 columns: 684 ms 30 columns: 1096 ms 40 columns: 1219 ms 50 columns: 1809 ms ... (heap failure after this) Running it on the latest trunk as of today: Response time for querying a single column on a large table (column size 10 MB): 10 columns: 52 ms 20 columns: 59 ms 30 columns: 72 ms 40 columns: 100 ms 50 columns: 109 ms 60 columns: 134 ms 70 columns: 155 ms 80 columns: 165 ms 90 columns: 178 ms 100 columns: 199 ms That's absolutely perfect, I just wish this was addressed in 2.1 or maybe even 2.2, moving my production environment to 3.4 is way too scary. > Add a 'NO EMPTY RESULTS' filter to SELECT > ----------------------------------------- > > Key: CASSANDRA-6588 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6588 > Project: Cassandra > Issue Type: Improvement > Reporter: Sylvain Lebresne > Priority: Minor > > It is the semantic of CQL that a (CQL) row exists as long as it has one > non-null column (including the PK columns, which, given that no PK columns > can be null, means that it's enough to have the PK set for a row to exist). > This does means that the result to > {noformat} > CREATE TABLE test (k int PRIMARY KEY, v1 int, v2 int); > INSERT INTO test(k, v1) VALUES (0, 4); > SELECT v2 FROM test; > {noformat} > must be (and is) > {noformat} > v2 > ------ > null > {noformat} > That fact does mean however that when we only select a few columns of a row, > we still need to find out rows that exist but have no values for the selected > columns. Long story short, given how the storage engine works, this means we > need to query full (CQL) rows even when only some of the columns are selected > because that's the only way to distinguish between "the row exists but have > no value for the selected columns" and "the row doesn't exist". I'll note in > particular that, due to CASSANDRA-5762, we can't unfortunately rely on the > row marker to optimize that out. > Now, when you selects only a subsets of the columns of a row, there is many > cases where you don't care about rows that exists but have no value for the > columns you requested and are happy to filter those out. So, for those cases, > we could provided a new SELECT filter. Outside the potential convenience (not > having to filter empty results client side), one interesting part is that > when this filter is provided, we could optimize a bit by only querying the > columns selected, since we wouldn't need to return rows that exists but have > no values for the selected columns. > For the exact syntax, there is probably a bunch of options. For instance: > * {{SELECT NON EMPTY(v2, v3) FROM test}}: the vague rational for putting it > in the SELECT part is that such filter is kind of in the spirit to DISTINCT. > Possibly a bit ugly outside of that. > * {{SELECT v2, v3 FROM test NO EMPTY RESULTS}} or {{SELECT v2, v3 FROM test > NO EMPTY ROWS}} or {{SELECT v2, v3 FROM test NO EMPTY}}: the last one is > shorter but maybe a bit less explicit. As for {{RESULTS}} versus {{ROWS}}, > the only small object to {{NO EMPTY ROWS}} could be that it might suggest it > is filtering non existing rows (I mean, the fact we never ever return non > existing rows should hint that it's not what it does but well...) while we're > just filtering empty "resultSet rows". > Of course, if there is a pre-existing SQL syntax for that, it's even better, > though a very quick search didn't turn anything. Other suggestions welcome > too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)