[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

2022-11-22 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637255#comment-17637255
 ] 

Andres de la Peña edited comment on CASSANDRA-18060 at 11/22/22 1:22 PM:
-

Thanks for the review.

The test failure [can be reproduced on 
trunk|https://app.circleci.com/pipelines/github/adelapena/cassandra/2511/workflows/7aba8baa-0a6d-404a-b08b-c6a8078caca3/jobs/24706/tests]
 with the multiplexer. Just opened CASSANDRA-18065 for it.

I have added the suggested JavaDoc and documentation for the already existing 
{{sum}} and {{avg}} functions, and for the new {{collection_sum}} and 
{{{}collection_avg{}}}.

Regarding followup tickets, I can add one for {{avg}} returning {{NaN}} instead 
of zero, but I don't know if it's too late for that since the current behaviour 
has been around for ages.

As for overflows and truncated decimals, we'll definitively need a followup 
ticket. The {{sum}} and {{avg}} functions can walk around this by using type 
casting, for example:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v int);
SELECT sum(cast(v AS varint)) FROM t;
SELECT avg(cast(v AS float)) FROM t;
{code}
However, we don't have such type of casting for collections, so we could have a 
followup ticket for adding that feature, so we can do something like:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v list);
SELECT collection_sum(cast(v AS list)) FROM t;
SELECT collection_avg(cast(v AS lsit)) FROM t;
{code}
CC [~blerer]


was (Author: adelapena):
Thanks for the review.

The test failure [can be reproduced on 
trunk|https://app.circleci.com/pipelines/github/adelapena/cassandra/2511/workflows/7aba8baa-0a6d-404a-b08b-c6a8078caca3/jobs/24706/tests]
 with the multiplexer. Just opened CASSANDRA-18065 for it.

I' have added the suggested JavaDoc and documentation for the already existing 
{{sum}} and {{avg}} functions, and for the new {{collection_sum}} and 
{{{}collection_avg{}}}.

Regarding followup tickets, I can add one for {{avg}} returning {{NaN}} instead 
of zero, but I don't know if it's too late for that since the current behaviour 
has been around for ages.

As for overflows and truncated decimals, we'll definitively need a followup 
ticket. The {{sum}} and {{avg}} functions can walk around this by using type 
casting, for example:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v int);
SELECT sum(cast(v AS varint)) FROM t;
SELECT avg(cast(v AS float)) FROM t;
{code}
However, we don't have such type of casting for collections, so we could have a 
followup ticket for adding that feature, so we can do something like:
{code:java}
CREATE TABLE t (k int PRIMARY KEY, v list);
SELECT collection_sum(cast(v AS list)) FROM t;
SELECT collection_avg(cast(v AS lsit)) FROM t;
{code}
CC [~blerer]

> Add aggregation scalar functions on collections
> ---
>
> Key: CASSANDRA-18060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL/Semantics
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list, m map);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> +--
>   {1, 2, 3} | [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> +
>   3 |  3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | system.collection_max(l)
> --+--
> 1 |3
> > SELECT collection_sum(l), collection_avg(l) FROM t;
>  system.collection_sum(l) | system.collection_avg(l)
> --+--
> 6 |2
> {code}
> Note that this type of aggregation is different from the kind of aggregation 
> provided by {{min}}, {{max}}, {{sum}} and {{avg}}, which aggregate entire 
> collections across rows. Here we only aggregate the items of a 

[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

2022-11-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636066#comment-17636066
 ] 

Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:10 PM:
--

Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that if we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collection-items 
functions.


was (Author: adelapena):
Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that if we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.

> Add aggregation scalar functions on collections
> ---
>
> Key: CASSANDRA-18060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL/Semantics
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list, m map);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> +--
>   {1, 2, 3} | [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> +
>   3 |  3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | 

[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

2022-11-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636066#comment-17636066
 ] 

Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:08 PM:
--

Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that if we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.


was (Author: adelapena):
Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.

> Add aggregation scalar functions on collections
> ---
>
> Key: CASSANDRA-18060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL/Semantics
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list, m map);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> +--
>   {1, 2, 3} | [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> +
>   3 |  3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | 

[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

2022-11-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636066#comment-17636066
 ] 

Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:06 PM:
--

Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
 * {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.
 * {{avg}} and {{collection_avg}} return a value of the same type as the input 
value, so for example the average of integers 1 and 2 is 1, instead of 1.5.
 * {{avg}} and {{collection_avg}} return 0 for an empty list of values, instead 
of the more correct {{{}NaN{}}}.

I guess that we are unhappy with those behaviours we could have followup 
tickets to try to improve them in both across-rows and across-collections-items 
functions.


was (Author: adelapena):
Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
* {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.

> Add aggregation scalar functions on collections
> ---
>
> Key: CASSANDRA-18060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL/Semantics
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list, m map);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> +--
>   {1, 2, 3} | [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> +
>   3 |  3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | system.collection_max(l)
> --+--
> 1 |3
> > SELECT collection_sum(l), collection_avg(l) FROM t;
>  system.collection_sum(l) | system.collection_avg(l)
> --+--
> 6 |2
> {code}
> Note that this type of aggregation is different from the kind of aggregation 
> 

[jira] [Comment Edited] (CASSANDRA-18060) Add aggregation scalar functions on collections

2022-11-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636066#comment-17636066
 ] 

Andres de la Peña edited comment on CASSANDRA-18060 at 11/18/22 11:00 PM:
--

Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent. However, that 
consistency means that we inherit the design decisions taken for those 
functions. The more remarkable ones IMO are
* {{sum}} and {{collection_sum}} return a value of the same type as the added 
values, so any numeric value but {{decimal}} and {{varint}} can overflow.


was (Author: adelapena):
Here is the patch, and CI is running:
||PR||CI||
|[trunk|https://github.com/apache/cassandra/pull/2024]|[j8|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/92f054d7-9386-498f-9ba4-330181cd4782]
 
[j11|https://app.circleci.com/pipelines/github/adelapena/cassandra/2508/workflows/8a0838e8-ffbb-424d-a572-3770f9a41632]|

Differently to [the 
prototype|https://github.com/apache/cassandra/compare/trunk...adelapena:cassandra:17811-trunk-collections?expand=1]
 mentioned during CASSANDRA-17811, the proposed PR uses the existing 
aggregation functions available at {{AggregateFcts}} as the underlying 
implementation of {{{}collection_min{}}}, {{{}collection_max{}}}, 
{{collection_sum}} and {{{}collection_avg{}}}. That way we avoid code 
duplication and make sure that the functions are consistent.

> Add aggregation scalar functions on collections
> ---
>
> Key: CASSANDRA-18060
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18060
> Project: Cassandra
>  Issue Type: New Feature
>  Components: CQL/Semantics
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new mechanism for dynamically building native functions introduced by 
> CASSANDRA-17811 can be used to provide within-collection aggregation 
> functions. We can use that mechanism to add new CQL functions to get:
>  * The number of items in a collection.
>  * The max/min items of a collection.
>  * The sum/avg of the items of a numeric collection.
>  * The keys or the values of a map.
> For example:
> {code:java}
> CREATE TABLE k.t (k int PRIMARY KEY, l list, m map);
> INSERT INTO t(k, l, m) VALUES (0, [1, 2, 3], {1:10, 2:20, 3:30});
> > SELECT map_keys(m), map_values(m) FROM t;
>  system.map_keys(m) | system.map_values(m)
> +--
>   {1, 2, 3} | [10, 20, 30]
> > SELECT collection_count(m), collection_count(l) FROM t;
>  system.collection_count(m) | system.collection_count(l)
> +
>   3 |  3
> > SELECT collection_min(l), collection_max(l) FROM t;
>  system.collection_min(l) | system.collection_max(l)
> --+--
> 1 |3
> > SELECT collection_sum(l), collection_avg(l) FROM t;
>  system.collection_sum(l) | system.collection_avg(l)
> --+--
> 6 |2
> {code}
> Note that this type of aggregation is different from the kind of aggregation 
> provided by {{min}}, {{max}}, {{sum}} and {{avg}}, which aggregate entire 
> collections across rows. Here we only aggregate the items of a collection row 
> per row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org