[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Sylvain Lebresne (JIRA) Fri, 16 Nov 2012 11:55:13 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499052#comment-13499052
 ]


Sylvain Lebresne commented on CASSANDRA-4915:
---------------------------------------------

bq. Is it a clause that must be applied to let this query run?

Yes, that's the idea.

bq. Whenever we iterate over more then MAX_EXAMINED we shout circuit and return 
what we have

That's a good idea. However I'd rather see it as a way to fine-tune the 
behavior of the {{FILTERING ALLOWED}} idea above (even if at the end, we end up 
with pretty much the same than what you suggest). Let me explain.

What I'd like to do is:
# refuse queres as they are today when they might involve "filtering" data.  By 
filtering here I mean that some records are read but discarded from the 
resultSet.
# adds an {{ALLOW FILTERING}} syntax that "unlock" those queries (as in, allow 
the query to run).
# when {{ALLOW FILTERING}} is used, allow to specify the maximum number of 
filtered records with say {{ALLOW FILTERING MAX 500}}.

I believe we're reached concensus on 1., but basically the arguments are above.
Now 2.+ 3. is pretty much the equivalent of Ed's idea (more precisely, using 
{{LIMIT X ALLOW FILTERING MAX Y}} would be the equivalent of {{LIMIT X 
MAX_PREPARED X+Y}} if I understand Ed's proposal right). However, the reason 
why I think we should allow 2. alone are that:
* I do think 2. is useful in it's own right. Or rather, you may have cases 
where you want all results period. How course you could provide a very big 
value for the max filtered, but that's lame. Or another way to put it is that 
it's one thing to say "I understand this query may do some unknown amount of 
useless work underneath but go ahead" and a slightly different one to control 
exactly how much of that uselless work you allow.
* Part 3. is a bit of a break of the API abstraction. What I mean here is that 
the actual behavior/result of a MAX_EXAMINED will depends on implementation 
details. Say tomorrow we'll optimize somehow how much records are actually 
examined to answer a query, then a query MAX_EXAMINED may return a different 
result tomorrow even on the exact same setting. Part 2. doesn't have this 
problem, and so while I'm good having 3. because I see how it can be useful, 
I'd rather not have it alone.
* On the very practical side of things, part 3. is more complex to implement.  
I'm pretty sure it'll require some storage engine change for instance. Also, I 
think there is points to clarify: if you shortcut the query, how does the user 
know if the query was shortcut or not? We can probably add some flag to the 
ResultSet I suppose, or somethine else, but the point is that I'd rather take 
the time to do that part right. Meaning that I think shoving it in 1.2.0 at 
this point is imho a bad idea. So I'd rather do part 2. now, which I'm 
confident is well defined, and improve with part 3. later.

                
> CQL should prevent or warn about inefficient queries
> ----------------------------------------------------
>
>                 Key: CASSANDRA-4915
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4915
>             Project: Cassandra
>          Issue Type: Improvement
>    Affects Versions: 1.2.0 beta 1
>            Reporter: Edward Capriolo
>            Priority: Minor
>
> When issuing a query like:
> {noformat}
> CREATE TABLE videos (
>   videoid uuid,
>   videoname varchar,
>   username varchar,
>   description varchar,
>   tags varchar,
>   upload_date timestamp,
>   PRIMARY KEY (videoid,videoname)
> );
> SELECT * FROM videos WHERE videoname = 'My funny cat';
> {noformat}
> Cassandra samples some data using get_range_slice and then applies the query.
> This is very confusing to me, because as an end user am not sure if the query 
> is fast because Cassandra is performing an optimized query (over an index, or 
> using a slicePredicate) or if cassandra is simple sampling some random rows 
> and returning me some results. 
> My suggestions:
> 1) force people to supply a LIMIT clause on any query that is going to
> page over get_range_slice
> 2) having some type of explain support so I can establish if this
> query will work in the
> I will champion suggestion 1) because CQL has put itself in a rather unique 
> un-sql like position by applying an automatic limit clause without the user 
> asking for them. I also do not believe the CQL language should let the user 
> issue queries that will not work as intended with "larger-then-auto-limit" 
> size data sets.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-4915) CQL should prevent or warn about inefficient queries

Reply via email to