[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Sylvain Lebresne (JIRA) Tue, 06 Nov 2012 00:56:19 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491326#comment-13491326
 ]


Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

I agree that us doing a full scan for that kind of query is confusing. In fact, 
that's a break of our otherwise applied rule: we don't allow queries that are 
not indexed. In that case we don't have an index and fallback to a full scan 
which we never do otherwise.

So the "logical" thing would just be to refuse that type of query (but to be 
clear, I think "SELECT * FROM videos" should always be allowed, because there 
is no surprise here, you've asked everything). We talked about allowing 
indexing the component of the clustering key (and though it's not done yet, I 
see no reason not to do it eventually), and once that is done we will be able 
to do those queries efficiently and it's only then that we should, again in 
theory, allow them.

Now in practice there is the fact that those queries more or less correspond to 
range_slice_queries and there is a good chance people would complain if we 
disallow them. I do note that it's not fully equivalent to the thrift case 
however, in the sense that in the thrift case you're literally asking for some 
sub-slice of all rows (or at least a range of rows), and in the result you will 
get all the rows, but with an empty set of columns if the provided filter 
selected nothing. In CQL3, you select _only_ the rows _where_ some predicate is 
true, so you won't get all those internal rows that have nothing for you.

bq. force people to supply a LIMIT clause

I really don't think this is a LIMIT problem and thus I don't think forcing (or 
doing anything with) LIMIT is the solution. Namely, if you have billions of 
rows and none of them has {{videoname = 'My funny cat'}}, then whatever the 
limit you provide (even 1) this query will timeout. Now I have some things to 
say about LIMIT and I've created CASSANDRA-4918 for that, but this is a 
completely orthogonal problem imo.

So in terms of solutions, here are the ones I would suggest by order of 
preferences:
# we could add a new {{ALLOW FULL SCAN}} option to {{SELECT}} queries that 
would explicitly say "I allow the engine to do a full scan and thus I 
understand my query performance may suck immensely". We would then not allow 
queries like
{noformat}
SELECT * FROM videos WHERE videoname = 'My funny cat'
{noformat}
  until we support 2ndary indexing videoname, but we would allow
{noformat}
SELECT * FROM videos WHERE videoname = 'My funny cat' ALLOW FULL SCAN
{noformat}
  (alternative syntax could be 'ALLOW NON-INDEXED SCAN' or whatever). I think 
this would be in line with what we want for Cassandra: make the user explicitly 
conscious of the performance implications of its queries. We could even later 
extend the support of this 'ALLOW FULL SCAN' bits by bits to other type of 
queries we refuse today (though I'm certainly not implying this should be a 
priority).
# if others really don't like my previous idea, I do think that the logical 
next best thing is to refuse that type of queries pure and simple.
# as a last resort (though I don't really like it tbh), we could add some form 
a simple explain that would tell you whether a query is indexed or not (but I 
largely prefer the 'you have to explicitly say you're fine with non-indexed' 
solution).

                
> CQL should force limit when query samples data.
> -----------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query 
> is fast because Cassandra is performing an optimized query (over an index, or 
> using a slicePredicate) or if cassandra is simple sampling some random rows 
> and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique 
> un-sql like position by applying an automatic limit clause without the user 
> asking for them. I also do not believe the CQL language should let the user 
> issue queries that will not work as intended with "larger-then-auto-limit" 
> size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should force limit when query samples data.

Reply via email to