[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

Jira Fri, 18 Nov 2022 15:09:06 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636066#comment-17636066
 ]


Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:08 PM:
--------------------------------------------------------------------------

Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that if we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.


was (Author: adelapena):
Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.

> Add aggregation scalar functions on collections
> -----------------------------------------------
>
>                 Key: CASSANDRA-18060
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: CQL/Semantics
>            Reporter: Andres de la Peña
>            Assignee: Andres de la Peña
>            Priority: Normal
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list<int>, m map<int, int>);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> --------------------+----------------------
>           {1, 2, 3} |         [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> ----------------------------+----------------------------
>                           3 |                          3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | system.collection_max(l)
> --------------------------+--------------------------
>                         1 |                        3
> > SELECT collection_sum(l), collection_avg(l) FROM t;
>  system.collection_sum(l) | system.collection_avg(l)
> --------------------------+--------------------------
>                         6 |                        2
> {code}
> Note that this type of aggregation is different from the kind of aggregation 
> provided by {{min}}, {{max}}, {{sum}} and {{avg}}, which aggregate entire 
> collections across rows. Here we only aggregate the items of a collection row 
> per row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

Reply via email to