[ https://issues.apache.org/jira/browse/CASSANDRA-4915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13499052#comment-13499052 ]
Sylvain Lebresne commented on CASSANDRA-4915: --------------------------------------------- bq. Is it a clause that must be applied to let this query run? Yes, that's the idea. bq. Whenever we iterate over more then MAX_EXAMINED we shout circuit and return what we have That's a good idea. However I'd rather see it as a way to fine-tune the behavior of the {{FILTERING ALLOWED}} idea above (even if at the end, we end up with pretty much the same than what you suggest). Let me explain. What I'd like to do is: # refuse queres as they are today when they might involve "filtering" data. By filtering here I mean that some records are read but discarded from the resultSet. # adds an {{ALLOW FILTERING}} syntax that "unlock" those queries (as in, allow the query to run). # when {{ALLOW FILTERING}} is used, allow to specify the maximum number of filtered records with say {{ALLOW FILTERING MAX 500}}. I believe we're reached concensus on 1., but basically the arguments are above. Now 2.+ 3. is pretty much the equivalent of Ed's idea (more precisely, using {{LIMIT X ALLOW FILTERING MAX Y}} would be the equivalent of {{LIMIT X MAX_PREPARED X+Y}} if I understand Ed's proposal right). However, the reason why I think we should allow 2. alone are that: * I do think 2. is useful in it's own right. Or rather, you may have cases where you want all results period. How course you could provide a very big value for the max filtered, but that's lame. Or another way to put it is that it's one thing to say "I understand this query may do some unknown amount of useless work underneath but go ahead" and a slightly different one to control exactly how much of that uselless work you allow. * Part 3. is a bit of a break of the API abstraction. What I mean here is that the actual behavior/result of a MAX_EXAMINED will depends on implementation details. Say tomorrow we'll optimize somehow how much records are actually examined to answer a query, then a query MAX_EXAMINED may return a different result tomorrow even on the exact same setting. Part 2. doesn't have this problem, and so while I'm good having 3. because I see how it can be useful, I'd rather not have it alone. * On the very practical side of things, part 3. is more complex to implement. I'm pretty sure it'll require some storage engine change for instance. Also, I think there is points to clarify: if you shortcut the query, how does the user know if the query was shortcut or not? We can probably add some flag to the ResultSet I suppose, or somethine else, but the point is that I'd rather take the time to do that part right. Meaning that I think shoving it in 1.2.0 at this point is imho a bad idea. So I'd rather do part 2. now, which I'm confident is well defined, and improve with part 3. later. > CQL should prevent or warn about inefficient queries > ---------------------------------------------------- > > Key: CASSANDRA-4915 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4915 > Project: Cassandra > Issue Type: Improvement > Affects Versions: 1.2.0 beta 1 > Reporter: Edward Capriolo > Priority: Minor > > When issuing a query like: > {noformat} > CREATE TABLE videos ( > videoid uuid, > videoname varchar, > username varchar, > description varchar, > tags varchar, > upload_date timestamp, > PRIMARY KEY (videoid,videoname) > ); > SELECT * FROM videos WHERE videoname = 'My funny cat'; > {noformat} > Cassandra samples some data using get_range_slice and then applies the query. > This is very confusing to me, because as an end user am not sure if the query > is fast because Cassandra is performing an optimized query (over an index, or > using a slicePredicate) or if cassandra is simple sampling some random rows > and returning me some results. > My suggestions: > 1) force people to supply a LIMIT clause on any query that is going to > page over get_range_slice > 2) having some type of explain support so I can establish if this > query will work in the > I will champion suggestion 1) because CQL has put itself in a rather unique > un-sql like position by applying an automatic limit clause without the user > asking for them. I also do not believe the CQL language should let the user > issue queries that will not work as intended with "larger-then-auto-limit" > size data sets. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira