[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-02-26 Thread Benjamin Lerer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169740#comment-15169740
 ] 

Benjamin Lerer edited comment on CASSANDRA-10707 at 2/26/16 8:27 PM:
-

I have pushed a set of commits to address the different review points.

I have not changed the CQL syntax so far.

I did not rename {{CqlGroupByLimits}} as it is only used for {{GROUP BY}} 
queries and not for aggregate queries (queries with aggregates but no group 
by). I also did not rename {{GroupMaker}} as the class is really about creating 
groups, but I am fully open to discussion.

I noticed that {{CQLLimits.forShortReadRetry()}} does not provide any limit on 
the number of rows either.




was (Author: blerer):
I have pushed a set of commits to address the different review points.

I have not changed the CQL syntax so far.

I did not rename {{CqlGroupByLimits}} as it is only used for {{GROUP BY}} 
queries and not for aggregate queries (queries with aggregates but no group 
by). I also did not rename {{GroupMaker}} as the class is really about creating 
groups, but I am fully open to discussion.

I noticed that {{{CQLLimits.forShortReadRetry()}} does not provide any limit on 
the number of rows either.



> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-02-25 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167513#comment-15167513
 ] 

Jon Haddad edited comment on CASSANDRA-10707 at 2/25/16 6:46 PM:
-

I don't think changing the order of ORDER BY and GROUP BY is self explanatory, 
so it doesn't really offer any benefit, imo.  If I was trying out the feature 
I'd mostly be annoyed by it's difference from something I've got muscle memory 
for.  

If you wanted to be technically accurate about it, SQL is declarative.  The 
order in which you specify the predicates, for instance, doesn't matter, it 
just happens to line up with how we mentally process it.  If you chance the 
order of predicates in your WHERE clause it doesn't matter, you'll still end up 
with the same query result.

Assuming I'm understanding the implementation correctly, what you're saying is 
that the query behaves more the following:

{code}
select * from 
 ( select * from table order by some_field limit 100)
group by x,y,z
{code}

Is this correct, or am I missing something?  If it's the case, I hope this 
doesn't box us in later down the line if we want to add support for other 
operations (like sub queries).  If we're going to introduce more 
inconsistencies with SQL (which may be totally fair, I'm just thinking out loud 
here), we would want to put the GROUP BY after the LIMIT, since it's being 
applied then.  I'm not sure what this does to CQL in general, as now we've 
implicitly made the decision to introduce clauses in an imperative fashion.  
I'd rather not see new clauses added piece by piece with different rules 
depending on the context, that definitely won't make things any easier.

So my question is, is CQL a declarative language or not?  Will this ever be 
something we intend to allow:

{code}
select username, score, state count(state) as c from top_scores where game_id=5 
limit 1000 group by state order by c desc limit 5;
{code}

I don't think the above query works at all.  The aggregation is clearly a 
declarative clause.

Now, if the behavior of limit before aggregation is the right decision, that I 
might have to argue with.


was (Author: rustyrazorblade):
I don't think changing the order of ORDER BY and GROUP BY is self explanatory, 
so it doesn't really offer any benefit, imo.  If I was trying out the feature 
I'd mostly be annoyed by it's difference from something I've got muscle memory 
for.  

If you wanted to be technically accurate about it, SQL is declarative.  The 
order in which you specify the clauses doesn't matter, it just happens to line 
up with how we mentally process it.  If you chance the order of predicates in 
your WHERE clause it doesn't matter, you'll still end up with the same query 
result.

Assuming I'm understanding the implementation correctly, what you're saying is 
that the query behaves more the following:

{code}
select * from 
 ( select * from table order by some_field limit 100)
group by x,y,z
{code}

Is this correct, or am I missing something?  If it's the case, I hope this 
doesn't box us in later down the line if we want to add support for other 
operations (like sub queries).  If we're going to introduce more 
inconsistencies with SQL (which may be totally fair, I'm just thinking out loud 
here), we would want to put the GROUP BY after the LIMIT, since it's being 
applied then.  I'm not sure what this does to CQL in general, as now we've 
implicitly made the decision to introduce clauses in an imperative fashion.  
I'd rather not see new clauses added piece by piece with different rules 
depending on the context, that definitely won't make things any easier.

So my question is, is CQL a declarative language or not?  Will this ever be 
something we intend to allow:

{code}
select username, score, state count(state) as c from top_scores where game_id=5 
limit 1000 group by state order by c desc limit 5;
{code}

I don't think the above query works at all.  The aggregation is clearly a 
declarative clause.

Now, if the behavior of limit before aggregation is the right decision, that I 
might have to argue with.

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey,

[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-02-25 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167398#comment-15167398
 ] 

Jon Haddad edited comment on CASSANDRA-10707 at 2/25/16 4:31 PM:
-

{bq}
I do not think that the order in which we perform the operations and the CQL 
syntax should be linked. If I am not mistaken, we filter the data once they 
have been sorted, in most of the cases, but the restrictions appear before the 
ORDER BY clause. In my opinion we should stick to the SQL syntax for that. As 
most of the C* users have a SQL background it will prevent some confusions.
{bq}

I agree with this, sticking with SQL syntax will trip people up a lot less.  


was (Author: rustyrazorblade):
{code}
I do not think that the order in which we perform the operations and the CQL 
syntax should be linked. If I am not mistaken, we filter the data once they 
have been sorted, in most of the cases, but the restrictions appear before the 
ORDER BY clause. In my opinion we should stick to the SQL syntax for that. As 
most of the C* users have a SQL background it will prevent some confusions.
{code}

I agree with this, sticking with SQL syntax will trip people up a lot less.  

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-02-25 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167398#comment-15167398
 ] 

Jon Haddad edited comment on CASSANDRA-10707 at 2/25/16 4:32 PM:
-

{quote}
I do not think that the order in which we perform the operations and the CQL 
syntax should be linked. If I am not mistaken, we filter the data once they 
have been sorted, in most of the cases, but the restrictions appear before the 
ORDER BY clause. In my opinion we should stick to the SQL syntax for that. As 
most of the C* users have a SQL background it will prevent some confusions.
{quote}

I agree with this, sticking with SQL syntax will trip people up a lot less.  


was (Author: rustyrazorblade):
{bq}
I do not think that the order in which we perform the operations and the CQL 
syntax should be linked. If I am not mistaken, we filter the data once they 
have been sorted, in most of the cases, but the restrictions appear before the 
ORDER BY clause. In my opinion we should stick to the SQL syntax for that. As 
most of the C* users have a SQL background it will prevent some confusions.
{bq}

I agree with this, sticking with SQL syntax will trip people up a lot less.  

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-01-20 Thread Brian Hess (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109181#comment-15109181
 ] 

 Brian Hess edited comment on CASSANDRA-10707 at 1/20/16 7:07 PM:
--

Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max( x ) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, x INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum( x ) FROM myData GROUP BY 
partitionKey, clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.  So, the 
assumption that the data will be sorted when it arrives to the coordinator 
might not be true in all cases.


was (Author: brianmhess):
Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max(y) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, y INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum(y) FROM myData GROUP BY partitionKey, 
clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.  So, the 
assumption that the data will be sorted when it arrives to the coordinator 
might not be true in all cases.

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-01-20 Thread Brian Hess (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109181#comment-15109181
 ] 

 Brian Hess edited comment on CASSANDRA-10707 at 1/20/16 7:06 PM:
--

Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max(y) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, y INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum(y) FROM myData GROUP BY partitionKey, 
clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.  So, the 
assumption that the data will be sorted when it arrives to the coordinator 
might not be true in all cases.


was (Author: brianmhess):
Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max(x) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, x INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum(x) FROM myData GROUP BY partitionKey, 
clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.  So, the 
assumption that the data will be sorted when it arrives to the coordinator 
might not be true in all cases.

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-01-20 Thread Brian Hess (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109181#comment-15109181
 ] 

 Brian Hess edited comment on CASSANDRA-10707 at 1/20/16 7:04 PM:
--

Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max(x) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, x INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum(x) FROM myData GROUP BY partitionKey, 
clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.  So, the 
assumption that the data will be sorted when it arrives to the coordinator 
might not be true in all cases.


was (Author: brianmhess):
Correct, what [~iamaleksey] said.  In fact, pushing the aggregate computation 
to the replicas is troublesome at an RF>1.

Quick follow up - will this ticket also cover:
SELECT clusterCol, Max(x) FROM myData GROUP BY clusterCol;

That is, you group on a clustering column, but not on a partition key?

Second question - consider a table with schema myData(partitionKey INT, 
clusteringCol1 INT, clusteringCol2 INT, x INT, PRIMARY KEY ((partitionKey), 
clusteringCol1, clusteringCol2).  Now, will the following query be supported:
SELECT partitionKey, clusteringCol2, Sum(x) FROM myData GROUP BY partitionKey, 
clusteringCol2;

The reason I ask is that the following is not supported:
SELECT partitionKey, clusteringCol2, x FROM myData WHERE partitionKey=5 ORDER 
BY clusteringCol2;
Because you cannot order by clusteringCol2, only clusteringCol1.

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10707) Add support for Group By to Select statement

2016-01-01 Thread Benjamin Lerer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15076378#comment-15076378
 ] 

Benjamin Lerer edited comment on CASSANDRA-10707 at 1/1/16 9:36 PM:


Both will be supported.
What will not be supported is a {{group by}} clause where only a part of the 
partition key will be specified. For example, if a table has a primary key like 
{{PRIMARY KEY((partitionKey1, partitionKey2) clustering1, clustering2)}}, the 
following query will not be supported:
{{SELECT partitionKey1, MAX(value) FROM myTable GROUP BY partitionKey1}}

As for the aggregates, the grouping will be performed on the coordinator node. 
By consequence, if the driver use the Token aware policy, a query containing a 
partition key predicate will be more efficient as the aggregates will be built 
on the node where the data are located.

>From the syntax point of view, the queries:
{{SELECT partitionKey, clusteringColumn1, Max(value) FROM myTable WHERE 
partitionKey=5 GROUP BY partitionKey, clusteringColumn1;}}
and  {{SELECT partitionKey, clusteringColumn1, Max(value) FROM myTable WHERE 
partitionKey=5 GROUP BY clusteringColumn1;}} will be both supported due to the 
fact that the {{partitionKey}} column is restricted by an {{=}} operator.


was (Author: blerer):
Both will be supported.
What will not be supported is a {{group by}} clause were only a part of the 
partition key will be specified. For example, if a table has a primary key like 
{{PRIMARY KEY((partitionKey1, partitionKey2) clustering1, clustering2)}}, the 
following query will not be supported:
{{SELECT partitionKey1, MAX(value) FROM myTable GROUP BY partitionKey1}}

As for the aggregates, the grouping will be performed on the coordinator node. 
By consequence, if the driver use the Token aware policy, a query containing a 
partition key predicate will be more efficient as the aggregates will be built 
on the node where the data are located.

>From the syntax point of view, the queries:
{{SELECT partitionKey, clusteringColumn1, Max(value) FROM myTable WHERE 
partitionKey=5 GROUP BY partitionKey, clusteringColumn1;}}
and  {{SELECT partitionKey, clusteringColumn1, Max(value) FROM myTable WHERE 
partitionKey=5 GROUP BY clusteringColumn1;}} will be both supported due to the 
fact that the {{partitionKey}} column is restricted by an {{=}} operator.

> Add support for Group By to Select statement
> 
>
> Key: CASSANDRA-10707
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL
>Reporter: Benjamin Lerer
>Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)