[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324146#comment-14324146
 ] 

Cristian O commented on CASSANDRA-4914:
---------------------------------------

A couple of thoughts:

- doing aggregations on the coordinator is clearly not feasible in the real 
world beyond some toy use cases. I don't know the internals but it should be 
doable to push the aggregation function to the partitions without requiring the 
data interface to understand CQL. Note that *all* agg functions are eminently 
parallelizible including AVG which obviously can be computed from SUM/COUNT on 
the same elements. As someone pointed out before these are all REDUCE type 
functions (or monoids if you like)

- dealing with consistency is tricky but then Cassandra is by design eventually 
consistent so why not have eventually consistent aggregations. Just pick a 
partition and aggregate on that. With large datasets an average differing at 
the sixth decimal won't really matter. Or if you want to be really fancy 
compute on every (or quorum) partitions and return results with a tolerance 
factor. 

Maybe it's useful to target this feature at use cases that need fast simple 
aggregates on large amounts of data like for example charts on time series.

For more complex analytics Spark on top of Cass is actually an excellent 
solution already if it's setup correctly in terms of colocation. This would 
help use cases when Spark is too much of an overhead. 

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column 
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                  
>                   
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                      
>               
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to