[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323798#comment-14323798
 ] 

Robert Stupp commented on CASSANDRA-4914:
-----------------------------------------

// flame off
You could use Spark instead of Hadoop ;)

You're right - computation of aggregates is done by the coordinator that has to 
pull all rows and do computation on it. That's (unfortunately) what we can do 
now. If aggregates are applied to some partitions or on an even bigger data 
set, performance is directly proportional to the number of involved partitions 
(sounds better than _getting slower_).

I have been thinking about a method to let the other nodes ("owners of other 
partitions") take part in aggregate calculation. But that implies that the 
other nodes _know_ about the aggregate - i.e. basically the actual CQL. Means: 
the approach *could* be a two-stage aggregate, where the first stage runs on 
the partitions and a second (final) stage runs on the partial results from the 
first stage. But the current "storage protocol" does not allow us to do that - 
it just allows to grab _raw data_. Such an approach might also improve edge 
cases that require ALLOW FILTERING, which basically do the same (pipe all data 
to the coordinator and filter in the coordinator).

Your approach looks interesting (although I'm not a statistics guru). Although 
I'm not sure what's meant by _first record_ or _smart sampling_ since there's 
nothing like ordering by partition key. Don't get me wrong - I'm interested in 
that.

> Aggregation functions in CQL
> ----------------------------
>
>                 Key: CASSANDRA-4914
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Vijay
>            Assignee: Benjamin Lerer
>              Labels: cql, docs
>             Fix For: 3.0
>
>         Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
> CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt
>
>
> The requirement is to do aggregation of data in Cassandra (Wide row of column 
> values of int, double, float etc).
> With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
> the columns within a row).
> Example:
> SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;                  
>                   
>  empid | deptid | first_name | last_name | salary
> -------+--------+------------+-----------+--------
>    130 |      3 |     joe    |     doe   |   10.1
>    130 |      2 |     joe    |     doe   |    100
>    130 |      1 |     joe    |     doe   |  1e+03
>  
> SELECT sum(salary), empid FROM emp WHERE empID IN (130);                      
>               
>  sum(salary) | empid
> -------------+--------
>    1110.1    |  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to