[ 
https://issues.apache.org/jira/browse/CASSANDRA-10707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107583#comment-15107583
 ] 

 Brian Hess commented on CASSANDRA-10707:
-----------------------------------------

I think that supporting grouping by clustering column (or perhaps even a 
regular column) with a partition key predicate is a good idea.  

I think that supporting grouping by partition key (either in part, or in toto) 
is a bad idea.  In that query, all the data in the cluster would stream to the 
coordinator who would then be responsible for doing a *lot* of processing.  In 
other distributed systems that do GROUP BY queries, the groups end up being 
split up among the nodes in the system and each node is responsible for rolling 
up the data for those groups it was assigned.  This is a common way to get all 
the nodes in the system to help with a pretty significant computation - and the 
data streamed out (potentially via a single node in the system) to the client.  
However, in this approach, all the data is streaming to a single node and that 
node is doing all the work, for all the groups.  This feels like either a ton 
of work to orchestrate the computation (that would start to mimic other systems 
- e.g., Spark) or would do a lot of work and risk being very inefficient and 
slow.  I am also concerned to what this would do in the face of 
QueryTimeoutException - would we really be able to do a GROUP BY partitionKey 
aggregate under the QTE limit?


> Add support for Group By to Select statement
> --------------------------------------------
>
>                 Key: CASSANDRA-10707
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10707
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: CQL
>            Reporter: Benjamin Lerer
>            Assignee: Benjamin Lerer
>
> Now that Cassandra support aggregate functions, it makes sense to support 
> {{GROUP BY}} on the {{SELECT}} statements.
> It should be possible to group either at the partition level or at the 
> clustering column level.
> {code}
> SELECT partitionKey, max(value) FROM myTable GROUP BY partitionKey;
> SELECT partitionKey, clustering0, clustering1, max(value) FROM myTable GROUP 
> BY partitionKey, clustering0, clustering1; 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to