[ https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17636066#comment-17636066 ]
Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:08 PM: -------------------------------------------------------------------------- Here is the patch, and CI is running: ||PR||CI|| |[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782] [j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]| Differently to [the prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1] mentioned during CASSANDRA-17811, the proposed PR uses the existing aggregation functions available at {{AggregateFcts}} as the underlying implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, {{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code duplication and make sure that the functions are consistent. However, that consistency means that we inherit the design decisions taken for those functions. The more remarkable ones IMO are * {{sum}} and {{collection_sum}} return a value of the same type as the added values, so any numeric value but {{decimal}} and {{varint}} can overflow. * {{avg}} and {{collection_avg}} return a value of the same type as the input value, so for example the average of integers 1 and 2 is 1, instead of 1.5. * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead of the more correct {{{}NaN{}}}. I guess that if we are unhappy with those behaviours we could have followup tickets to try to improve them in both across-rows and across-collections-items functions. was (Author: adelapena): Here is the patch, and CI is running: ||PR||CI|| |[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782] [j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]| Differently to [the prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1] mentioned during CASSANDRA-17811, the proposed PR uses the existing aggregation functions available at {{AggregateFcts}} as the underlying implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, {{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code duplication and make sure that the functions are consistent. However, that consistency means that we inherit the design decisions taken for those functions. The more remarkable ones IMO are * {{sum}} and {{collection_sum}} return a value of the same type as the added values, so any numeric value but {{decimal}} and {{varint}} can overflow. * {{avg}} and {{collection_avg}} return a value of the same type as the input value, so for example the average of integers 1 and 2 is 1, instead of 1.5. * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead of the more correct {{{}NaN{}}}. I guess that we are unhappy with those behaviours we could have followup tickets to try to improve them in both across-rows and across-collections-items functions. > Add aggregation scalar functions on collections > ----------------------------------------------- > > Key: CASSANDRA-18060 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18060 > Project: Cassandra > Issue Type: New Feature > Components: CQL/Semantics > Reporter: Andres de la Peña > Assignee: Andres de la Peña > Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > The new mechanism for dynamically building native functions introduced by > CASSANDRA-17811 can be used to provide within-collection aggregation > functions. We can use that mechanism to add new CQL functions to get: > * The number of items in a collection. > * The max/min items of a collection. > * The sum/avg of the items of a numeric collection. > * The keys or the values of a map. > For example: > {code:java} > CREATE TABLE k.t (k int PRIMARY KEY, l list<int>, m map<int, int>); > INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30}); > > SELECT map_keys(m), map_values(m) FROM t; > system.map_keys(m) | system.map_values(m) > --------------------+---------------------- > {1, 2, 3} | [10, 20, 30] > > SELECT collection_count(m), collection_count(l) FROM t; > system.collection_count(m) | system.collection_count(l) > ----------------------------+---------------------------- > 3 | 3 > > SELECT collection_min(l), collection_max(l) FROM t; > system.collection_min(l) | system.collection_max(l) > --------------------------+-------------------------- > 1 | 3 > > SELECT collection_sum(l), collection_avg(l) FROM t; > system.collection_sum(l) | system.collection_avg(l) > --------------------------+-------------------------- > 6 | 2 > {code} > Note that this type of aggregation is different from the kind of aggregation > provided by {{min}}, {{max}}, {{sum}} and {{avg}}, which aggregate entire > collections across rows. Here we only aggregate the items of a collection row > per row. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org