[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-18 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326112#comment-14326112
 ] 

Robert Stupp commented on CASSANDRA-4914:
-

I've opened CASSANDRA-8862 + CASSANDRA-8827. Discussion about improvements 
sbhould go there.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-18 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326012#comment-14326012
 ] 

Benedict commented on CASSANDRA-4914:
-

I'm with Cristian here, as I suggested at the NGCC last year. If we want 
efficient aggregations, they should absolutely be performed at the replicas. I 
realise we're not aiming for that first time around, but IMO it should be the 
long term goal. Shipping all of your data over the wire is a pretty significant 
cost and bottleneck, making the current implementation more of a convenience 
than an analytic tool.

It's possible to perform conflict resolution a few ways. Probably the best is 
to first let the user specify if they care (CL=ONE is not exactly an uncommon 
usecase, last I heard we reckon 30% of deployments use this. esp. for analytics 
queries slight staleness may not be important), and if they do perform a 
repair-aware read from each neighbour to ensure the replica is up-to-date. Or 
calculate the result optimistically, along with a checksum and perform the 
repair if either don't match. Or select your strategy based on if the data has 
been updated recently (say, last few minutes), and if it has be pessimistic, 
and otherwise be optimistic. This is largely what [~tjake]'s Repair Aware 
Consistency Levels (CASSANDRA-7168) is about.

Generally, analytics queries are intended to be run over large, _majority_ 
static datasets, so the computation should be optimised for this IMO. There is 
of course the complication of supporting deterministic aggregations over 
multiple partitions, which would probably have to fallback to coordinator level 
aggregation for operations that cannot be trivially composed exactly (e.g. 
median), but most aggregations can be composed from partial computations 
trivially.

The provision of a sampled approach seems like another excellent idea to me, 
but an orthogonal one. The calculation should probably still be offloaded to 
each node, then combined probabilistically. This would also support efficient 
multi-partition queries for all aggregations.

I'm not saying any of these are trivial undertakings, but they should be what 
we're aiming for AFAICT.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Cristian O (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324355#comment-14324355
 ] 

Cristian O commented on CASSANDRA-4914:
---

To be clear I'm not talking about distributed processing. I'm talking about 
online analitical queries, particularly time series. Incidentally, I don't know 
how many people realize this but Cass has the same distributed storage 
architecture that Vertica has. It's also possible to map a columnar schema on 
top of sstable. Of course native support for columnar storage would be 
immensely better.  See CASSANDRA-7447

There's a lot of opportunity in this space, just need a bit of vision.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324408#comment-14324408
 ] 

Tyler Hobbs commented on CASSANDRA-4914:


bq. I don't know the internals but it should be doable to push the aggregation 
function to the partitions without requiring the data interface to understand 
CQL.

The problem with pushing aggregate calculation down to the replicas is that 
there's no conflict resolution.  So the aggregation can be computed over stale 
or deleted data.  That may be acceptable if you're reading at consistency level 
ONE, but then we're dealing with a limited, special case.

bq. Note that all agg functions are eminently parallelizible

I don't believe this is true.  Off the top of my head, computing the median of 
a dataset is not really parallelizable (without some sort of internode 
communication).

bq. dealing with consistency is tricky but then Cassandra is by design 
eventually consistent so why not have eventually consistent aggregations. Just 
pick a partition and aggregate on that. With large datasets an average 
differing at the sixth decimal won't really matter.

That may be acceptable for aggregates like average, but other aggregates may 
require precision.

With all of that said, I wouldn't necessarily be opposed to supporting 
selecting a sampling of data from a table (and allowing an aggregate to be run 
over that), but I suggest opening a new ticket for that discussion.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Anton Slutsky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324266#comment-14324266
 ] 

Anton Slutsky commented on CASSANDRA-4914:
--

Robert Stupp, you are absolutely right!  Somehow, I forgot we are hashing!  
This is cool.  I'll look into the code more carefully, but if hash buckets are 
narrow enough, maybe true random sampling can be assumed.  If that's the case, 
then this is not that difficult to do.

Also, please dont add distributed calculation facilities to Cassandra! :)  It 
will turn it into a poor-man's hadoop and either you guys will have to support 
a real distributed processing system or everyone will be pissed off that 
Map/Reduce on Cassandra doesnt work like it does on Hadoop :-)

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323798#comment-14323798
 ] 

Robert Stupp commented on CASSANDRA-4914:
-

// flame off
You could use Spark instead of Hadoop ;)

You're right - computation of aggregates is done by the coordinator that has to 
pull all rows and do computation on it. That's (unfortunately) what we can do 
now. If aggregates are applied to some partitions or on an even bigger data 
set, performance is directly proportional to the number of involved partitions 
(sounds better than _getting slower_).

I have been thinking about a method to let the other nodes (owners of other 
partitions) take part in aggregate calculation. But that implies that the 
other nodes _know_ about the aggregate - i.e. basically the actual CQL. Means: 
the approach *could* be a two-stage aggregate, where the first stage runs on 
the partitions and a second (final) stage runs on the partial results from the 
first stage. But the current storage protocol does not allow us to do that - 
it just allows to grab _raw data_. Such an approach might also improve edge 
cases that require ALLOW FILTERING, which basically do the same (pipe all data 
to the coordinator and filter in the coordinator).

Your approach looks interesting (although I'm not a statistics guru). Although 
I'm not sure what's meant by _first record_ or _smart sampling_ since there's 
nothing like ordering by partition key. Don't get me wrong - I'm interested in 
that.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Ahmet AKYOL (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323933#comment-14323933
 ] 

Ahmet AKYOL commented on CASSANDRA-4914:


As [~jbellis] said it's a different yet interesting case. IMHO, some 
probabilistic data structures can be implemented like bloom filter.  A library 
like [stream-lib|https://github.com/addthis/stream-lib] can be used (by the 
way, they say that, [~jbellis]'s blog post about bloom filters has inspired 
them ). Hyperloglog can also be useful,just look [Redis's 
implementation|http://redis.io/commands#hyperloglog].

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-17 Thread Cristian O (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324146#comment-14324146
 ] 

Cristian O commented on CASSANDRA-4914:
---

A couple of thoughts:

- doing aggregations on the coordinator is clearly not feasible in the real 
world beyond some toy use cases. I don't know the internals but it should be 
doable to push the aggregation function to the partitions without requiring the 
data interface to understand CQL. Note that *all* agg functions are eminently 
parallelizible including AVG which obviously can be computed from SUM/COUNT on 
the same elements. As someone pointed out before these are all REDUCE type 
functions (or monoids if you like)

- dealing with consistency is tricky but then Cassandra is by design eventually 
consistent so why not have eventually consistent aggregations. Just pick a 
partition and aggregate on that. With large datasets an average differing at 
the sixth decimal won't really matter. Or if you want to be really fancy 
compute on every (or quorum) partitions and return results with a tolerance 
factor. 

Maybe it's useful to target this feature at use cases that need fast simple 
aggregates on large amounts of data like for example charts on time series.

For more complex analytics Spark on top of Cass is actually an excellent 
solution already if it's setup correctly in terms of colocation. This would 
help use cases when Spark is too much of an overhead. 

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-16 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323676#comment-14323676
 ] 

Jonathan Ellis commented on CASSANDRA-4914:
---

That's really a different use case.  It could be an interesting one, but I'm 
not sure the infrastructure is there yet to support it.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-16 Thread Anton Slutsky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323716#comment-14323716
 ] 

Anton Slutsky commented on CASSANDRA-4914:
--

I think, it may not be all that complicated, at least in some cases.  If we 
consider the avg function for example, any record in the resultset of interest 
has a non-zero probability of being exactly the average value, kind of by 
definition :-), and nothing prevents us from grabbing the very first record and 
looking at it from that point of view.  The key here is, of course, to figure 
out what that non-zero probability is, but that can also be approximated with 
some accuracy by sampling a little bit beyond the first record.  If we are 
smart about how we sample and if we have an idea as to how big the actual 
resultset is, reasonably close approximation of the average value can be 
achieved and the probability of it being the true average can be computed with 
common techniques. Along the same lines, sum can be thought of as an integral 
over the shape approximated by the avg, which can also be approximated with 
some probability of being correct.

Of course, there are many problems with the above from the statistical point of 
view.  For one, resultsets are often ordered in some way, so sampling cannot be 
assumed to be random, which is not good.  

Anyway, I dont know if this is the right use case, but I really need aggregate 
functions for what I'm trying to do and right now I have to fire up a hadoop 
cluster to get simple aggregates computed, which is a major pain and takes 
forever.

I'll give it a shot in my own code and see if I can come up with a reasonable 
approach.  Perhaps others will see this discussion and suggest some ideas.

Thanks,
Anton

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2015-02-16 Thread Anton Slutsky (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323669#comment-14323669
 ] 

Anton Slutsky commented on CASSANDRA-4914:
--

Hello all,

I noticed that some of the aggregate functions discussed on this thread made it 
into the trunk.  I'm a little concerned with the implementation.  It looks like 
aggregates, such as sum, avg, etc. are implemented in code by basically looping 
through the result set pages and computing the desired aggregates in code.  I'm 
worried that, since Cassandra is meant for large volumes of data, this is not 
at all a feasible implementation for real world cases.  I tried using avg on a 
more or less sizable dataset and observed two things -- first, my select 
statement would time out even with bumped up read timeout setting and second, 
CPU that's running the average computation is quite busy.

Obviously, there's only so much that can be done in terms of computing these 
aggregates without resorting to some sort of distributed computation framework, 
but I'd like to suggest a slightly different approach.  I wonder if we can just 
rethink how we think about aggregate functions in context of large data.  
Perhaps, what we could do is consider a probabilistic aggregates instead of raw 
computable ones?  That is, instead of striving to compute an aggregate on an 
entire resultset, maybe we can compute the aggregate with a stated probability 
of that aggregate being true.

For example:

select probabilistic_avg(my_col) from my_table;

would return something like a map:

{avg:101.1, prob:0.78}

where avg is our probabilistic avg and prob is the probability of it being 
what we say it is.

Of course, that wont be as good as the real thing, but it still has value in 
many cases, I think.  And it can be implemented in a scalable way with some 
scratch system tables.

I'm happy to give it a stab if this is of interest to anyone.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914-V5.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-10-06 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160217#comment-14160217
 ] 

Sylvain Lebresne commented on CASSANDRA-4914:
-

The last patch looks almost good to go (provided it's rebased) for me but there 
seems to be some confusion around the {{reset}} method for {{Selection}}. 
You've removed the abstract one in {{Selection}} itself, but both 
{{SimpleSelection}} and {{SelectionWithFunctions}} still have a {{reset}} 
method which, as far as I can tell, is never called. And while the one in 
{{SimpleSelection}} does nothing, the one in {{SelectionWithFunctions}} seems 
to do something. I think what you'd want is add back the {{reset}} method to 
{{Selection}} but actually call it in {{ResultSetBuilder.newRow/build}} (as a 
nit, you can then move the {{current = null}} in {{SimpleSelection}} in the 
{{reset}} method for consistency).

bq.  Postgres's approach to user-defined aggregation functions is almost 
exactly this: http://www.postgresql.org/docs/8.3/static/sql-createaggregate.html

I agree that it's where we want to go (there was some discussion around this at 
the bootcamp that ended up with the same conclusion but it sound like I forgot 
to update JIRA with said conclusions). That said, this ticket is mostly about 
the internal interface for aggregation functions and as far as I can tell, 
nothing in the attached patch limit us from going with a reduce-style 
approach for UDF (it will be in fact rather simple). So let's maybe leave the 
discussion on UDF in CASSANDRA-8053.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-10-03 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158400#comment-14158400
 ] 

Tyler Hobbs commented on CASSANDRA-4914:


MySQL and SqlServer also return the same type as the argument and suggest 
casting to handle overflows (although casting is limited in mysql). I think 
casting is simple enough and returning the same type as the argument is the 
least surprising behavior, so my vote is for sticking with that.  I agree that 
the behavior needs to be well-documented, though.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-10-03 Thread Tyler Hobbs (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158569#comment-14158569
 ] 

Tyler Hobbs commented on CASSANDRA-4914:


I'm thinking a bit about making this compatible with with UDFs.  The problem 
with this approach is that it relies on state that's not visible to the 
aggregation functions.

An alternative that would be (more easily) compatible with UDFs is a 
reduce-style aggregation.  The reducer function takes two inputs: the current 
state and the next value.  You can optionally provide an initial state and a 
finalizer function that is called with the final state after reducing. UDTs, 
tuples, and collections should be sufficiently powerful to represent anything 
that's needed for state.

In fact, Postgres's approach to user-defined aggregation functions is almost 
exactly this: 
http://www.postgresql.org/docs/8.3/static/sql-createaggregate.html.  I think we 
could slightly simplify their approach by inferring the data types.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-10-03 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14158595#comment-14158595
 ] 

Robert Stupp commented on CASSANDRA-4914:
-

bq. UDTs, tuples, and collections should be sufficiently powerful to represent 
anything that's needed for state

Nice idea just to use existing (complex) types for state :)
Cool - then we could simply use two UDAF function bodies. One for row 
processing and one for result. Think that's easy to understand for everybody :)
Something like
{code}
CREATE AGGREGATION FUNCTION superDuperSum ( input double )
RETURNS double
[STATE fooBarUDT]
LANGUAGE javascript
REDUCER $$...$$
[INITIALIZER $$...$$]
[FINALIZER $$...$$]
{code}
(optionals enclosed with square brackets)

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-09-28 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151074#comment-14151074
 ] 

Robert Stupp commented on CASSANDRA-4914:
-

bq. aggregate functions do only return their input type ...

Hm - I still think that we should be able to handle integer overflows - if for 
example people get a negative result if summing just positive values, they'll 
complain about it.
Postgres for example [returns a bigint for int 
sums|http://www.postgresql.org/docs/9.1/static/functions-aggregate.html] - 
Oracle doesn't have any integer data type in tables (only that NUMBER data 
type).
(But if that input-type==output-type behavior is clearly documented and people 
are able to do e.g. a {{SELECT sum( (varint) myIntCol ) FROM ...}} then that's 
fine for me - but I'm not sold on this.)

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914-V4.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-09-14 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133128#comment-14133128
 ] 

Robert Stupp commented on CASSANDRA-4914:
-

Nice work, [~blerer] :)

* Regarding {{COUNT(\*)}}/{{COUNT(1)}} - can you change/extend that to also 
allow {{COUNT(any_column)}} or even {{COUNT(func_call)}}? The v3 patch does not 
handle this.
* Can you add a test to check whether an aggregate works against a UDF and vice 
versa? E.g. some sin function: {{SELECT sin(max(a)) FROM foo}} or {{SELECT 
max(sin(a)) FROM foo}}
* The aggregate functions do only return their input type. Means: overflows are 
not handled. It's not a problem for the {{BigDecimal}}/{{BigInteger}} 
implementations but for all the others. Maybe it's better to do all sum+avg on 
the {{BigDecimal}}/{{BigInteger}} types and always return them - although it's 
slower, the result will be correct. As a nice side effect it could be possible 
to reduce the number of avg/sum implementations and let them extend some few 
impls.
* The implementations of the aggregate functions are sometimes declared using 
{{public static final}} and sometimes using {{public static}} - nice to have: 
make them look equal


 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-09-14 Thread Benjamin Lerer (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14133395#comment-14133395
 ] 

Benjamin Lerer commented on CASSANDRA-4914:
---

bq. Nice work, Benjamin Lerer :-)

Thanks Robert.
 
bq.  Regarding COUNT(*)/COUNT(1) - can you change/extend that to also allow 
COUNT(any_column) or even COUNT(func_call)? The v3 patch does not handle this.

It is already there. See last line of AggregationTest.testFunctions()

bq.Can you add a test to check whether an aggregate works against a UDF and 
vice versa? E.g. some sin function: SELECT sin(max(a)) FROM foo or SELECT 
max(sin(a)) FROM foo

The first form is already there. See AggregationTest.testNestedFunctions(). I 
will add the second form.

bq. The aggregate functions do only return their input type. Means: overflows 
are not handled. It's not a problem for the BigDecimal/BigInteger 
implementations but for all the others. Maybe it's better to do all sum+avg on 
the BigDecimal/BigInteger types and always return them - although it's slower, 
the result will be correct. As a nice side effect it could be possible to 
reduce the number of avg/sum implementations and let them extend some few impls.

See first feed back from Sylvain

bq.The implementations of the aggregate functions are sometimes declared 
using public static final and sometimes using public static - nice to have: 
make them look equal

I will fix it.


 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914-V3.txt, 
 CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-09-12 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132148#comment-14132148
 ] 

Sylvain Lebresne commented on CASSANDRA-4914:
-

The general approach lgtm but a bunch of remarks:
* I'm not sure about the sum and avg return type. For sum, using {{bigint}} 
whatever the input type is is wrong for at least {{varint}}. We could use 
{{varint}} as a return type instead, but I wonder if it's really that 
convenient. Typically, in the java driver, {{varint}} is mapped to 
{{BigInteger}} and I'm not sure always returning that even when summing {{int}} 
is what people wants in general. What I suggest is to have {{sum}} and {{avg}} 
method always return the same type than their argument, and that we allow users 
to type-cast the argument if they want a greater precision that the input type 
(from a quick look at their doc, that seems to be what oracle does, haven't 
checked other).
* I'd rather get rid of the specific code for {{COUNT(*)}} in SelectStatement. 
It sounds to me that we could just have a specific no-arg function for that (we 
could still have some special casing in the parser).
* In {{ScalarFunctionSelector.isAggregate}}, we shouldn't assume that the 
function has one argument, it can have none (this is currently breaking a 
UFTest unit test). Also, I believe something like:
  {noformat}
  SELECT max(a), add(max(b), c)  FROM ..
  {noformat}
  is not rejected (because {{ScalarFunctionSelector}} only even check it's 
first argument). I'll note though that it could make sense to allow:
  {noformat}
  SELECT max(a), log(max(b), 2)  FROM ..
  {noformat}
  so maybe we should (not that it's a big deal if we don't, but unless I'm 
wrong, it's reasonably trivial to so ...).  As an aside, a comment on why 
{{ScalarFunctionSector}} don't just return true for {{isAggregate}} could be 
useful here.
* Since Selector has a reset() method, why not use that to reset the 
{{current}} value for {{SimpleSelection}} and others?
* I'd have a preference for making ScalarFunction/AggregateFunction be abstract 
classes and have them implement isAggregate (instead of having the 
NativeScalarFunction/NativeAggretationFunction).  
* Needs to switch a 'assertInvalid' to 'assertInvalidSyntax' in 
{{AggregationTest.testInvalidCalls}}.

A couple of nitpicks too:
* There is a typo in the error message in 
{{AggregateFunctionSelector.validateArguments}}.
* Regarding code style, we don't put @Override when implementing an interface 
(AggregateFcts is full of them).
* I believe {{Selection.reset()}} is unused. Maybe we can remove it for now 
until we actually need it.
* I'd shorten {{SelectionWithFunctions.validateSelectors}} to something like
  {noformat}
  int aggregations = 0;
  for (Select s : selectors)
  if (s.isAggregate())
  ++aggregations;

  if (aggregations != 0  aggregations != selectors.size())
  throw new InvalidRequestException(the select clause must either contains 
only aggregations or none);
  {noformat}
  Feels like too many ifs for what it does currently.


 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914-V2.txt, CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-08-27 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112098#comment-14112098
 ] 

Sylvain Lebresne commented on CASSANDRA-4914:
-

I have not yet reviewed this, but at very quick glance, I see nothing that 
handles the paging of aggregations: we don't want to load everything in memory 
with a single query. I'll note that we already support one hard-coded 
aggregation function, COUNT, and you can look at how it's handled by 
SelectStatement to see what I mean by paging internally. Ideally the patch 
for this should replace this currently hardcoded COUNT to use the general 
mechanism for aggregation. Also, we also want to support syntax like {{SELECT 
AVG(*) FROM ...}} which I don't think this patch handles (since it doesn't 
modify the parser and it would need to).

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-08-27 Thread Brian Hess (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112192#comment-14112192
 ] 

 Brian Hess commented on CASSANDRA-4914:


[~slebresne] - what does SELECT AVG(*) FROM supposed to mean?  If I have a 
table called MYDATA with schema (pkey INT, ccol INT, x INT, y INT, PRIMARY KEY 
((pkey), ccol)), then what would SELECT AVG(*) FROM mydata WHERE pkey=1 AND 
ccol=1 return?  I think you need to include a column to AVG, like SELECT 
AVG(x) FROM mydata WHERE pkey=1 AND ccol=1 or SELECT AVG(x), AVG(y) FROM 
mydata WHERE pkey=1 AND ccol=1.

COUNT() is a special case.  COUNT(*) really is the same as COUNT(1), because 
you are just counting the number of rows.  Note, though, that COUNT(x) is not 
the same as COUNT(*) since COUNT() will not count things that are NULL, so 
COUNT(*) will count all rows but COUNT(x) will only count rows where X is not 
NULL.  At least, according to SQL.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-08-27 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14112210#comment-14112210
 ] 

Sylvain Lebresne commented on CASSANDRA-4914:
-

bq.  what does SELECT AVG(\*) FROM supposed to mean?

Nothing, that was a brain fart, forget about that part.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql, docs
 Fix For: 3.0

 Attachments: CASSANDRA-4914.txt


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-4914) Aggregation functions in CQL

2014-08-18 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100672#comment-14100672
 ] 

Sylvain Lebresne commented on CASSANDRA-4914:
-

Updating the title to refect that for this ticket we'll stick to aggregation 
functions. We've already introduced some UDFs in CASSANDRA-7395 so we'll want 
to try to reuse as much as possible here in term of syntax, though obviously 
the interface for aggregation functions will be slightly different.

 Aggregation functions in CQL
 

 Key: CASSANDRA-4914
 URL: https://issues.apache.org/jira/browse/CASSANDRA-4914
 Project: Cassandra
  Issue Type: New Feature
Reporter: Vijay
Assignee: Benjamin Lerer
  Labels: cql
 Fix For: 3.0


 The requirement is to do aggregation of data in Cassandra (Wide row of column 
 values of int, double, float etc).
 With some basic agree gate functions like AVG, SUM, Mean, Min, Max, etc (for 
 the columns within a row).
 Example:
 SELECT * FROM emp WHERE empID IN (130) ORDER BY deptID DESC;  
   
  empid | deptid | first_name | last_name | salary
 ---+++---+
130 |  3 | joe| doe   |   10.1
130 |  2 | joe| doe   |100
130 |  1 | joe| doe   |  1e+03
  
 SELECT sum(salary), empid FROM emp WHERE empID IN (130);  
   
  sum(salary) | empid
 -+
1110.1|  130



--
This message was sent by Atlassian JIRA
(v6.2#6252)