Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Josh McKenzie
ta type is consistent, easy to do and easy to 
>>>>>>>>>>>>>>>> understand. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don't see the need to add rules explicitly forbidding some 
>>>>>>>>>>>>>>>> data types on MIN/MAX functions just because we can't easily 
>>>>>>>>>>>>>>>> figure out a use case for their ordering. Especially when we 
>>>>>>>>>>>>>>>> are exposing that sa

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Benedict
r 
>>>>>>>>>>>>>>>> collections we could change it to return the aggregated max 
>>>>>>>>>>>>>>>> across all rows. Or we just leave it as an error and if 
>>>>>>>>>>>>>>>> someone wants the max across all rows they would ask for 
>>>>>>>>>>>>>>>> MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX 
>>>>>>>>>>>>>>>> may be a bad name.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> > On Dec 6, 2022, at 11:55 AM, Benedict  
>>>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>>> > 
>>>>>>>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Andrés de la Peña
in storage 
>>>>>>>> format
>>>>>>>> would be affected - which we can probably treat as zero.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan <
>>>>>>>> jeremiah.jor...@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> 
>>>>>>>> >>>
>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that
>>>>>>>> operates over rows containing collections, returning the collection 
>>>>>>>> with
>>>>>>>> the most elements. This is just a nonsensical operation to support 
>>>>>>>> IMO. We
>>>>>>>> should decide as a community whether we “fix” this aggregation, or 
>>>>>>>> remove
>>>>>>>> it.
>>>>>>>> >>
>>>>>>>> >> The current MAX function does not work this way afaik?  It
>>>>>>>> returns the row with the column that has the highest value in 
>>>>>>>> clustering
>>>>>>>> order sense, like if the collection was used as a clustering key.  
>>>>>>>> While
>>>>>>>> that also may have limited use, I don’t think it worth while to 
>>>>>>>> deprecate
>>>>>>>> such use and all the headache that comes with doing so.
>>>>>>>> >>
>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for
>>>>>>>> discovery, and all-else equal it would be better to use MAX,MIN, etc, 
>>>>>>>> same
>>>>>>>> as for aggregations.
>>>>>>>> >>
>>>>>>>> >> If we actually wanted to move towards using the existing names
>>>>>>>> with new meanings, then I think that would take us multiple major
>>>>>>>> releases.  First deprecate existing use in current releases.  Then 
>>>>>>>> make it
>>>>>>>> an error in the next major release X.  Then change the behavior in 
>>>>>>>> major
>>>>>>>> release X+1.  Just switching the behavior without having a major where 
>>>>>>>> such
>>>>>>>> queries error out would make a bunch of user queries start returning
>>>>>>>> “wrong” data.
>>>>>>>> >> Also I don’t think those functions being cross row aggregations
>>>>>>>> for some column types, but within row collection operations for other
>>>>>>>> types, is any more intuitive, and actually would be more confusing.  
>>>>>>>> So I
>>>>>>>> am -1 on using the same names.
>>>>>>>> >>
>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_
>>>>>>>> to operate over non-collection types when they are explicitly 
>>>>>>>> collection
>>>>>>>> variants.
>>>>>>>> >>
>>>>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Benedict
gt;>> > 
>>>>>>>>>>>>>> > As far as I am aware it has never worked in a release, and so 
>>>>>>>>>>>>>> > deprecating it is probably not as challenging as you think. 
>>>>>>>>>>>>>> > Only folk that have been able to parse the raw bytes of the 
>>>>>>>>>>>>>> > collection in storage format would be affected - which we can 
>>>>>>>>>>>>>> > probably treat as zero.
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan 
>>>>>>>>>>>>>> >>  wrote:
>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that 
>>>>>>>>>>>>>> >>> operates over rows containing collections, returning the 
>>>>>>>>>>>>>> >>> collection with the most elements. This is just a 
>>>>>>>>>>>>>> >>> nonsensical operation to support IMO. We should decide as a 
>>>>>>>>>>>>>> >>> community whether we “fix” this aggregation, or remove it.
>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>> >> The current MAX function does not work this way afaik?  It 
>>>>>>>>>>>>>> >> returns the row with the column that has the highest value in 
>>>>>>>>>>>>>> >> clustering order sense, like if the collection was used as a 
>>>>>>>>>>>>>> >> clustering key.  While that also may have limited use, I 
>>>>>>>>>>>>>> >> don’t think it worth while to deprecate such use and all the 
>>>>>>>>>>>>>> >> headache that comes with doing so.
>>>>>&g

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Benedict
nsensical 
>>>>>>>>>>>>> >>> operation to support IMO. We should decide as a community 
>>>>>>>>>>>>> >>> whether we “fix” this aggregation, or remove it.
>>>>>>>>>>>>> >> 
>>>>>>>>>>>>> >> The current MAX function does not work this way afaik?  It 
>>>>>>>>>>>>> >> returns the row with the column that has the highest value in 
>>>>>>>>>>>>> >> clustering order sense, like if the collection was used as a 
>>>>>>>>>>>>> >> clustering key.  While that also may have limited use, I don’t 
>>>>>>>>>>>>> >> think it worth while to deprecate such use and all the 
>>>>>>>>>>>>> >> headache that comes with doing so.
>>>>>>>>>>>>> >> 
>>>>>>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive 
>>>>>>>>>>>>> >>> for discovery, and all-else equal it would be better to use 
>>>>>>>>>>>>> >>> MAX,MIN, etc, same as for aggregations.
>>>>>>>>>>>>> >> 
>>>>>>>>>>>>> >> If we actually wanted to move towards using the existing names 
>>>>>>>>>>>>> >> with new meanings, then I think that would take us multiple 
>>>>>>>>>>>>> >> major releases.  First deprecate existing use in current 
>>>>>>>>>>>>> >> releases.  Then make it an error in the next major release X.  
>>>>>>>>>>>>> >> Then change the behavior in major release X+1.  Just switching 
>>>>>>>>>>>>> >> the behavior without having a major where such queries error 
>>>>>>>>>>>>> >> out would make a bunch of user queries st

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Claude Warren, Jr via dev
decide as a community whether we “fix” this aggregation, or remove it.
>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for
>>>>>>> discovery, and all-else equal it would be better to use MAX,MIN, etc, 
>>>>>>> same
>>>>>>> as for aggregations.
>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to
>>>>>>> operate over non-collection types when they are explicitly collection
>>>>>>> variants.
>>>>>>> >>>
>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains
>>>>>>> ambiguous, but this could be solved by either providing a separate 
>>>>>>> method
>>>>>>> for collections (e.g. SIZE) which seems fine to me, or by offering a
>>>>>>> precedence order for matching and a keyword for overriding the 
>>>>>>> precedence
>>>>>>> order (e.g. COUNT(collection AS COLLECTION)).
>>>>>>> >>>
>>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this
>>>>>>> can be solved several ways.
>>>>>>> >>> - We could permit explicit casts to collection types, that for a
>>>>>>> collection type would be a no-op, and for a single value would create a
>>>>>>> collection
>>>>>>> >>> - With precedence orders, by always selecting the scalar
>>>>>>> function last
>>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce
>>>>>>> function to resolve multiple values
>>>>>>> >>>
>>>>>>> >>> These decisions all imply trade-offs on each other, and affect
>>>>>>> the evolution of CQL, so I think community input would be helpful.
>>>>>>> >>>
>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña <
>>>>>>> adelap...@apache.org> wrote:
>>>>>>> >>>>
>>>>>>> >>>> 
>>>>>>> >>>> This will require some long introduction for context:
>>>>>>> >>>>
>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with
>>>>>>> min/max column value according to their comparator. For collections, the
>>>>>>> comparison is on the lexicographical order of the collection elements.
>>>>>>> That's the very same comparator that is used when collections are used 
>>>>>>> as
>>>>>>> clustering keys and for ORDER BY.
>>>>>>> >>>>
>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make
>>>>>>> that the results were presented in their unserialized form, although the
>>>>>>> row selection was correct. That bug was recently solved by 
>>>>>>> CASSANDRA-17811.
>>>>>>> During that ticket it was also considered the option of simply disabling
>>>>>>> MIN/MAX on collection since applying those functions to collections, 
>>>>>>> since
>>>>>>> they don't seem super useful. However, that option was quickly discarded
>>>>>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>>>>>> every data type.
>>>>>>> >>>>
>>>>>>> >>>> As a byproduct of the internal improvements of that fix,
>>>>>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>>>>>> aggregations of the elements of a collection. Those where named 
>>>>>>> "map_keys",
>>>>>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>>>>>> "collection_count". Those are the names mentioned on the mail list 
>>>>>>> thread
>>>>>>> about function naming conventions. Despite doing a kind of
>>>>>>> within-collection aggregation, these functions are not what we usually 
>>>>>>> call
>>>>>>> aggregate functions, since they don't aggregate multiple rows together.
>>>>>>> >>>>
>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>>>>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>>>>>> However, the new collection functions can be used in combination with 
>>>>>>> the
>>>>>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>>>>>> ttl of a multi-cell column. Since the new functions give a generic way 
>>>>>>> of
>>>>>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>>>>>> proposed to remove that MAXWRITETIME function.
>>>>>>> >>>>
>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to
>>>>>>> remove that function in favour of the new generic collection functions.
>>>>>>> However, the MAXWRITETIME function can work on both single-cell and
>>>>>>> multi-cell columns, whereas "COLLECTION_MAX(WRITETIME(column))" would 
>>>>>>> only
>>>>>>> work on multi-cell columns, That's because MAXWRITETIME of a 
>>>>>>> not-multicell
>>>>>>> column doesn't return a collection, and one should simply use
>>>>>>> "WRITETIME(column)" instead. So it was proposed in CASSANDRA-18037 that
>>>>>>> collections functions applied to a not-collection value consider that 
>>>>>>> value
>>>>>>> as the only element of a singleton collection. So, for example,
>>>>>>> COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. That ticket has already 
>>>>>>> been
>>>>>>> reviewed and it's mostly ready to commit.
>>>>>>> >>>>
>>>>>>> >>>> Now we can go straight to the point:
>>>>>>> >>>>
>>>>>>> >>>> Recently Benedict brought back the idea of deprecating
>>>>>>> aggregate functions applied to collections, the very same idea that was
>>>>>>> mentioned on CASSANDRA-17811 description almost four months ago. That 
>>>>>>> way
>>>>>>> we could rename the new collection functions MIN/MAX/SUM/AVG, same as 
>>>>>>> the
>>>>>>> classic aggregate functions. That way MIN/MAX/SUM/AVG would be an 
>>>>>>> aggregate
>>>>>>> function when applied to not-collection columns, and a scalar function 
>>>>>>> when
>>>>>>> applied to collection. We can't do that with COUNT because there would 
>>>>>>> be
>>>>>>> an ambiguity, so the proposal for that case is renaming 
>>>>>>> COLLECTION_COUNT to
>>>>>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>>>>>> proposal.
>>>>>>> >>>>
>>>>>>> >>>> I however would prefer to keep aggregate functions working on
>>>>>>> collections, and keep the names of the new collection functions as
>>>>>>> "COLLECTION_*". Reasons are:
>>>>>>> >>>>
>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be
>>>>>>> cosidered as breaking backward compatibility and require a deprecation
>>>>>>> plan.
>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might
>>>>>>> not look superuseful, but they make the set of aggregate functions
>>>>>>> consistent and applicable to every column type.
>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>>>>>> establishes a clear distinction between row aggregations and collection
>>>>>>> aggregations, while at the same time exposing the analogy between each 
>>>>>>> pair
>>>>>>> of functions.
>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>>>>>> workarounds such as using the column type when possible, or trying to
>>>>>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works 
>>>>>>> for
>>>>>>> this case, future functions can find more trouble when trying to figure 
>>>>>>> out
>>>>>>> workarounds to avoid clashing with existing function names. For 
>>>>>>> example, we
>>>>>>> might want to add a SIZE function that gets the size in bytes of any
>>>>>>> column, or we might want to add a MAX function that gets the maximum of 
>>>>>>> a
>>>>>>> set of columns, etc. And example of the synonym-based approach that 
>>>>>>> comes
>>>>>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>>>>>> aggregation and GREATEST is for column aggregation.
>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type,
>>>>>>> we can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>>>>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>>>>>> collection functions with time functions.
>>>>>>> >>>>
>>>>>>> >>>> What do others think? What should we do with aggregate
>>>>>>> functions on collections, collection functions and MAXWRITETIME?
>>>>>>> >>
>>>>>>>
>>>>>>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-09 Thread Andrés de la Peña
;>>> >> Also I don’t think those functions being cross row aggregations
>>>>>> for some column types, but within row collection operations for other
>>>>>> types, is any more intuitive, and actually would be more confusing.  So I
>>>>>> am -1 on using the same names.
>>>>>> >>
>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to
>>>>>> operate over non-collection types when they are explicitly collection
>>>>>> variants.
>>>>>> >>
>>>>>> >> While I could see some point to this, I do not think it would be
>>>>>> confusing for something named collection_XXX to treat a non-collection 
>>>>>> as a
>>>>>> collection of 1.  But maybe there is a better name for these function.
>>>>>> Rather than seeing them as collection variants, we should see them as
>>>>>> variants that operate on the data in a single row, rather than 
>>>>>> aggregating
>>>>>> across multiple rows.  But even with that perspective I don’t know what 
>>>>>> the
>>>>>> best name would be.
>>>>>> >>
>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict 
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Thanks Andres, I think community input on direction here will be
>>>>>> invaluable. There’s a bunch of interrelated tickets, and my opinions are 
>>>>>> as
>>>>>> follows:
>>>>>> >>>
>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates
>>>>>> over rows containing collections, returning the collection with the most
>>>>>> elements. This is just a nonsensical operation to support IMO. We should
>>>>>> decide as a community whether we “fix” this aggregation, or remove it.
>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for
>>>>>> discovery, and all-else equal it would be better to use MAX,MIN, etc, 
>>>>>> same
>>>>>> as for aggregations.
>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to
>>>>>> operate over non-collection types when they are explicitly collection
>>>>>> variants.
>>>>>> >>>
>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains
>>>>>> ambiguous, but this could be solved by either providing a separate method
>>>>>> for collections (e.g. SIZE) which seems fine to me, or by offering a
>>>>>> precedence order for matching and a keyword for overriding the precedence
>>>>>> order (e.g. COUNT(collection AS COLLECTION)).
>>>>>> >>>
>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this
>>>>>> can be solved several ways.
>>>>>> >>> - We could permit explicit casts to collection types, that for a
>>>>>> collection type would be a no-op, and for a single value would create a
>>>>>> collection
>>>>>> >>> - With precedence orders, by always selecting the scalar function
>>>>>> last
>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce
>>>>>> function to resolve multiple values
>>>>>> >>>
>>>>>> >>> These decisions all imply trade-offs on each other, and affect
>>>>>> the evolution of CQL, so I think community input would be helpful.
>>>>>> >>>
>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña <
>>>>>> adelap...@apache.org> wrote:
>>>>>> >>>>
>>>>>> >>>> 
>>>>>> >>>> This will require some long introduction for context:
>>>>>> >>>>
>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max
>>>>>> column value according to their comparator. For collections, the 
>>>>>> comparison
>>>>>> is on the lexicographical order of the collection elements. That's the 
>>>>>> very
>>>>>> same comparator that is used when collections are used as clustering keys
>>>>>> and for ORDER BY.
>>>>>> >>>>
>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make
>>>>>> that the results were presented in their unserialized form, although the
>>>>>> row selection was correct. That bug was recently solved by 
>>>>>> CASSANDRA-17811.
>>>>>> During that ticket it was also considered the option of simply disabling
>>>>>> MIN/MAX on collection since applying those functions to collections, 
>>>>>> since
>>>>>> they don't seem super useful. However, that option was quickly discarded
>>>>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>>>>> every data type.
>>>>>> >>>>
>>>>>> >>>> As a byproduct of the internal improvements of that fix,
>>>>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>>>>> aggregations of the elements of a collection. Those where named 
>>>>>> "map_keys",
>>>>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>>>>> "collection_count". Those are the names mentioned on the mail list thread
>>>>>> about function naming conventions. Despite doing a kind of
>>>>>> within-collection aggregation, these functions are not what we usually 
>>>>>> call
>>>>>> aggregate functions, since they don't aggregate multiple rows together.
>>>>>> >>>>
>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>>>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>>>>> However, the new collection functions can be used in combination with the
>>>>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>>>>> ttl of a multi-cell column. Since the new functions give a generic way of
>>>>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>>>>> proposed to remove that MAXWRITETIME function.
>>>>>> >>>>
>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove
>>>>>> that function in favour of the new generic collection functions. However,
>>>>>> the MAXWRITETIME function can work on both single-cell and multi-cell
>>>>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>>>>>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>>>>>> doesn't return a collection, and one should simply use 
>>>>>> "WRITETIME(column)"
>>>>>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>>>>>> applied to a not-collection value consider that value as the only element
>>>>>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>>>>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>>>>>> mostly ready to commit.
>>>>>> >>>>
>>>>>> >>>> Now we can go straight to the point:
>>>>>> >>>>
>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>>>>>> functions applied to collections, the very same idea that was mentioned 
>>>>>> on
>>>>>> CASSANDRA-17811 description almost four months ago. That way we could
>>>>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>>>>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>>>>>> function when applied to not-collection columns, and a scalar function 
>>>>>> when
>>>>>> applied to collection. We can't do that with COUNT because there would be
>>>>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT 
>>>>>> to
>>>>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>>>>> proposal.
>>>>>> >>>>
>>>>>> >>>> I however would prefer to keep aggregate functions working on
>>>>>> collections, and keep the names of the new collection functions as
>>>>>> "COLLECTION_*". Reasons are:
>>>>>> >>>>
>>>>>> >>>> 1 - Making aggregate functions not work on collections might be
>>>>>> cosidered as breaking backward compatibility and require a deprecation
>>>>>> plan.
>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not
>>>>>> look superuseful, but they make the set of aggregate functions consistent
>>>>>> and applicable to every column type.
>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>>>>> establishes a clear distinction between row aggregations and collection
>>>>>> aggregations, while at the same time exposing the analogy between each 
>>>>>> pair
>>>>>> of functions.
>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>>>>> workarounds such as using the column type when possible, or trying to
>>>>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works 
>>>>>> for
>>>>>> this case, future functions can find more trouble when trying to figure 
>>>>>> out
>>>>>> workarounds to avoid clashing with existing function names. For example, 
>>>>>> we
>>>>>> might want to add a SIZE function that gets the size in bytes of any
>>>>>> column, or we might want to add a MAX function that gets the maximum of a
>>>>>> set of columns, etc. And example of the synonym-based approach that comes
>>>>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>>>>> aggregation and GREATEST is for column aggregation.
>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type,
>>>>>> we can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>>>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>>>>> collection functions with time functions.
>>>>>> >>>>
>>>>>> >>>> What do others think? What should we do with aggregate functions
>>>>>> on collections, collection functions and MAXWRITETIME?
>>>>>> >>
>>>>>>
>>>>>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread J. D. Jordan
 because there would be an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not correctly exposing the proposal.
>>>> 
>>>> I however would prefer to keep aggregate functions working on collections, and keep the names of the new collection functions as "COLLECTION_*". Reasons are:
>>>> 
>>>> 1 - Making aggregate functions not work on collections might be cosidered as breaking backward compatibility and require a deprecation plan. 
>>>> 2 - Keeping aggregate functions working on collections might not look superuseful, but they make the set of aggregate functions consistent and applicable to every column type.
>>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes a clear distinction between row aggregations and collection aggregations, while at the same time exposing the analogy between each pair of functions.
>>>> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds such as using the column type when possible, or trying to figure out synonyms like in the case of COUNT/SIZE. Even if that works for this case, future functions can find more trouble when trying to figure out workarounds to avoid clashing with existing function names. For example, we might want to add a SIZE function that gets the size in bytes of any column, or we might want to add a MAX function that gets the maximum of a set of columns, etc. And example of the synonym-based approach that comes to mind is MySQL's MAX and GREATEST functions, where MAX is for row aggregation and GREATEST is for column aggregation.
>>>> 5 - If MIN/MAX function selection is based on the column type, we can't implement Yifan's proposal of making COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7, which would be very useful for combining collection functions with time functions.
>>>> 
>>>> What do others think? What should we do with aggregate functions on collections, collection functions and MAXWRITETIME?
>> 







Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
; operate over non-collection types when they are explicitly collection
>>>>> variants.
>>>>> >>>
>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains
>>>>> ambiguous, but this could be solved by either providing a separate method
>>>>> for collections (e.g. SIZE) which seems fine to me, or by offering a
>>>>> precedence order for matching and a keyword for overriding the precedence
>>>>> order (e.g. COUNT(collection AS COLLECTION)).
>>>>> >>>
>>>>> >>> Given (2), (3) is a little more difficult. However, I think this
>>>>> can be solved several ways.
>>>>> >>> - We could permit explicit casts to collection types, that for a
>>>>> collection type would be a no-op, and for a single value would create a
>>>>> collection
>>>>> >>> - With precedence orders, by always selecting the scalar function
>>>>> last
>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce
>>>>> function to resolve multiple values
>>>>> >>>
>>>>> >>> These decisions all imply trade-offs on each other, and affect the
>>>>> evolution of CQL, so I think community input would be helpful.
>>>>> >>>
>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> 
>>>>> >>>> This will require some long introduction for context:
>>>>> >>>>
>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max
>>>>> column value according to their comparator. For collections, the 
>>>>> comparison
>>>>> is on the lexicographical order of the collection elements. That's the 
>>>>> very
>>>>> same comparator that is used when collections are used as clustering keys
>>>>> and for ORDER BY.
>>>>> >>>>
>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make
>>>>> that the results were presented in their unserialized form, although the
>>>>> row selection was correct. That bug was recently solved by 
>>>>> CASSANDRA-17811.
>>>>> During that ticket it was also considered the option of simply disabling
>>>>> MIN/MAX on collection since applying those functions to collections, since
>>>>> they don't seem super useful. However, that option was quickly discarded
>>>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>>>> every data type.
>>>>> >>>>
>>>>> >>>> As a byproduct of the internal improvements of that fix,
>>>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>>>> aggregations of the elements of a collection. Those where named 
>>>>> "map_keys",
>>>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>>>> "collection_count". Those are the names mentioned on the mail list thread
>>>>> about function naming conventions. Despite doing a kind of
>>>>> within-collection aggregation, these functions are not what we usually 
>>>>> call
>>>>> aggregate functions, since they don't aggregate multiple rows together.
>>>>> >>>>
>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>>>> However, the new collection functions can be used in combination with the
>>>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>>>> ttl of a multi-cell column. Since the new functions give a generic way of
>>>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>>>> proposed to remove that MAXWRITETIME function.
>>>>> >>>>
>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove
>>>>> that function in favour of the new generic collection functions. However,
>>>>> the MAXWRITETIME function can work on both single-cell and multi-cell
>>>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>>>>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>>>>> doesn't return a collection, and one should simply use "WRITETIME(column)"
>>>>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>>>>> applied to a not-collection value consider that value as the only element
>>>>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>>>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>>>>> mostly ready to commit.
>>>>> >>>>
>>>>> >>>> Now we can go straight to the point:
>>>>> >>>>
>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>>>>> functions applied to collections, the very same idea that was mentioned on
>>>>> CASSANDRA-17811 description almost four months ago. That way we could
>>>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>>>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>>>>> function when applied to not-collection columns, and a scalar function 
>>>>> when
>>>>> applied to collection. We can't do that with COUNT because there would be
>>>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT 
>>>>> to
>>>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>>>> proposal.
>>>>> >>>>
>>>>> >>>> I however would prefer to keep aggregate functions working on
>>>>> collections, and keep the names of the new collection functions as
>>>>> "COLLECTION_*". Reasons are:
>>>>> >>>>
>>>>> >>>> 1 - Making aggregate functions not work on collections might be
>>>>> cosidered as breaking backward compatibility and require a deprecation
>>>>> plan.
>>>>> >>>> 2 - Keeping aggregate functions working on collections might not
>>>>> look superuseful, but they make the set of aggregate functions consistent
>>>>> and applicable to every column type.
>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>>>> establishes a clear distinction between row aggregations and collection
>>>>> aggregations, while at the same time exposing the analogy between each 
>>>>> pair
>>>>> of functions.
>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>>>> workarounds such as using the column type when possible, or trying to
>>>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
>>>>> this case, future functions can find more trouble when trying to figure 
>>>>> out
>>>>> workarounds to avoid clashing with existing function names. For example, 
>>>>> we
>>>>> might want to add a SIZE function that gets the size in bytes of any
>>>>> column, or we might want to add a MAX function that gets the maximum of a
>>>>> set of columns, etc. And example of the synonym-based approach that comes
>>>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>>>> aggregation and GREATEST is for column aggregation.
>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we
>>>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>>>> collection functions with time functions.
>>>>> >>>>
>>>>> >>>> What do others think? What should we do with aggregate functions
>>>>> on collections, collection functions and MAXWRITETIME?
>>>>> >>
>>>>>
>>>>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
.  So I am -1 on using the same names.
>>>>>>>>> >> 
>>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>>>>>>>> >>> operate over non-collection types when they are explicitly 
>>>>>>>>> >>> collection variants.
>>>>>>>>> >> 
>>>>>>>>> >> While I could see some point to this, I do not think it would be 
>>>>>>>>> >> confusing for something named collection_XXX to treat a 
>>>>>>>>> >> non-collection as a collection of 1.  But maybe there is a better 
>>>>>>>>> >> name for these function.  Rather than seeing them as collection 
>>>>>>>>> >> variants, we should see them as variants that operate on the data 
>>>>>>>>> >> in a single row, rather than aggregating across multiple rows.  
>>>>>>>>> >> But even with that perspective I don’t know what the best name 
>>>>>>>>> >> would be.
>>>>>>>>> >> 
>>>>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict  wrote:
>>>>>>>>> >>> 
>>>>>>>>> >>> Thanks Andres, I think community input on direction here will be 
>>>>>>>>> >>> invaluable. There’s a bunch of interrelated tickets, and my 
>>>>>>>>> >>> opinions are as follows:
>>>>>>>>> >>> 
>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates 
>>>>>>>>> >>> over rows containing collections, returning the collection with 
>>>>>>>>> >>> the most elements. This is just a nonsensical operation to 
>>>>>>>>> >>> support IMO. We should decide as a community whether we “fix” 
>>>>>>>>> >>> this aggregation, or remove it.
>>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, 
>>>>>>>>> >>> etc, same as for aggregations.
>>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>>>>>>>> >>> operate over non-collection types when they are explicitly 
>>>>>>>>> >>> collection variants.
>>>>>>>>> >>> 
>>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains 
>>>>>>>>> >>> ambiguous, but this could be solved by either providing a 
>>>>>>>>> >>> separate method for collections (e.g. SIZE) which seems fine to 
>>>>>>>>> >>> me, or by offering a precedence order for matching and a keyword 
>>>>>>>>> >>> for overriding the precedence order (e.g. COUNT(collection AS 
>>>>>>>>> >>> COLLECTION)).
>>>>>>>>> >>> 
>>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this 
>>>>>>>>> >>> can be solved several ways. 
>>>>>>>>> >>> - We could permit explicit casts to collection types, that for a 
>>>>>>>>> >>> collection type would be a no-op, and for a single value would 
>>>>>>>>> >>> create a collection
>>>>>>>>> >>> - With precedence orders, by always selecting the scalar function 
>>>>>>>>> >>> last
>>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce 
>>>>>>>>> >>> function to resolve multiple values
>>>>>>>>> >>> 
>>>>>>>>> >>> These decisions all imply trade-offs on each other, and affect 
>>>>>>>>> >>> the evolution of CQL, so I think community input would be helpful.
>>>>>>>>> >>> 
>>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
>>>>>>>>> >>>>>  wrote:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> 
>>>>>>>>> >>>> This will require some long introduction for context:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max 
>>>>>>>>> >>>> column value according to their comparator. For collections, the 
>>>>>>>>> >>>> comparison is on the lexicographical order of the collection 
>>>>>>>>> >>>> elements. That's the very same comparator that is used when 
>>>>>>>>> >>>> collections are used as clustering keys and for ORDER BY.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make 
>>>>>>>>> >>>> that the results were presented in their unserialized form, 
>>>>>>>>> >>>> although the row selection was correct. That bug was recently 
>>>>>>>>> >>>> solved by CASSANDRA-17811. During that ticket it was also 
>>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on collection 
>>>>>>>>> >>>> since applying those functions to collections, since they don't 
>>>>>>>>> >>>> seem super useful. However, that option was quickly discarded 
>>>>>>>>> >>>> and the operation was fixed so the MIN/MAX functions correctly 
>>>>>>>>> >>>> work for every data type.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can 
>>>>>>>>> >>>> perform aggregations of the elements of a collection. Those 
>>>>>>>>> >>>> where named "map_keys", "map_values", "collection_min", 
>>>>>>>>> >>>> "collection_max", "collection_sum", and "collection_count". 
>>>>>>>>> >>>> Those are the names mentioned on the mail list thread about 
>>>>>>>>> >>>> function naming conventions. Despite doing a kind of 
>>>>>>>>> >>>> within-collection aggregation, these functions are not what we 
>>>>>>>>> >>>> usually call aggregate functions, since they don't aggregate 
>>>>>>>>> >>>> multiple rows together.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>>>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell 
>>>>>>>>> >>>> column. However, the new collection functions can be used in 
>>>>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the 
>>>>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since 
>>>>>>>>> >>>> the new functions give a generic way of aggreagting timestamps 
>>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to 
>>>>>>>>> >>>> remove that MAXWRITETIME function.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove 
>>>>>>>>> >>>> that function in favour of the new generic collection functions. 
>>>>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell 
>>>>>>>>> >>>> and multi-cell columns, whereas 
>>>>>>>>> >>>> "COLLECTION_MAX(WRITETIME(column))" would only work on 
>>>>>>>>> >>>> multi-cell columns, That's because MAXWRITETIME of a 
>>>>>>>>> >>>> not-multicell column doesn't return a collection, and one should 
>>>>>>>>> >>>> simply use "WRITETIME(column)" instead. So it was proposed in 
>>>>>>>>> >>>> CASSANDRA-18037 that collections functions applied to a 
>>>>>>>>> >>>> not-collection value consider that value as the only element of 
>>>>>>>>> >>>> a singleton collection. So, for example, COLLECTION_MAX(7) = 
>>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed 
>>>>>>>>> >>>> and it's mostly ready to commit.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Now we can go straight to the point:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>>>>>>>> >>>> functions applied to collections, the very same idea that was 
>>>>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. 
>>>>>>>>> >>>> That way we could rename the new collection functions 
>>>>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That 
>>>>>>>>> >>>> way MIN/MAX/SUM/AVG would be an aggregate function when applied 
>>>>>>>>> >>>> to not-collection columns, and a scalar function when applied to 
>>>>>>>>> >>>> collection. We can't do that with COUNT because there would be 
>>>>>>>>> >>>> an ambiguity, so the proposal for that case is renaming 
>>>>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not 
>>>>>>>>> >>>> correctly exposing the proposal.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> I however would prefer to keep aggregate functions working on 
>>>>>>>>> >>>> collections, and keep the names of the new collection functions 
>>>>>>>>> >>>> as "COLLECTION_*". Reasons are:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>>>>>>>> >>>> cosidered as breaking backward compatibility and require a 
>>>>>>>>> >>>> deprecation plan. 
>>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not 
>>>>>>>>> >>>> look superuseful, but they make the set of aggregate functions 
>>>>>>>>> >>>> consistent and applicable to every column type.
>>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions 
>>>>>>>>> >>>> establishes a clear distinction between row aggregations and 
>>>>>>>>> >>>> collection aggregations, while at the same time exposing the 
>>>>>>>>> >>>> analogy between each pair of functions.
>>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>>>>>>>> >>>> workarounds such as using the column type when possible, or 
>>>>>>>>> >>>> trying to figure out synonyms like in the case of COUNT/SIZE. 
>>>>>>>>> >>>> Even if that works for this case, future functions can find more 
>>>>>>>>> >>>> trouble when trying to figure out workarounds to avoid clashing 
>>>>>>>>> >>>> with existing function names. For example, we might want to add 
>>>>>>>>> >>>> a SIZE function that gets the size in bytes of any column, or we 
>>>>>>>>> >>>> might want to add a MAX function that gets the maximum of a set 
>>>>>>>>> >>>> of columns, etc. And example of the synonym-based approach that 
>>>>>>>>> >>>> comes to mind is MySQL's MAX and GREATEST functions, where MAX 
>>>>>>>>> >>>> is for row aggregation and GREATEST is for column aggregation.
>>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, 
>>>>>>>>> >>>> we can't implement Yifan's proposal of making COLLECTION_MAX(7) 
>>>>>>>>> >>>> = COLLECTION_MAX([7]) = 7, which would be very useful for 
>>>>>>>>> >>>> combining collection functions with time functions.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> What do others think? What should we do with aggregate functions 
>>>>>>>>> >>>> on collections, collection functions and MAXWRITETIME?
>>>>>>>>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
tions, returning the collection with 
>>>>>>>> >>> the most elements. This is just a nonsensical operation to support 
>>>>>>>> >>> IMO. We should decide as a community whether we “fix” this 
>>>>>>>> >>> aggregation, or remove it.
>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, 
>>>>>>>> >>> etc, same as for aggregations.
>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>>>>>>> >>> operate over non-collection types when they are explicitly 
>>>>>>>> >>> collection variants.
>>>>>>>> >>> 
>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains 
>>>>>>>> >>> ambiguous, but this could be solved by either providing a separate 
>>>>>>>> >>> method for collections (e.g. SIZE) which seems fine to me, or by 
>>>>>>>> >>> offering a precedence order for matching and a keyword for 
>>>>>>>> >>> overriding the precedence order (e.g. COUNT(collection AS 
>>>>>>>> >>> COLLECTION)).
>>>>>>>> >>> 
>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this 
>>>>>>>> >>> can be solved several ways. 
>>>>>>>> >>> - We could permit explicit casts to collection types, that for a 
>>>>>>>> >>> collection type would be a no-op, and for a single value would 
>>>>>>>> >>> create a collection
>>>>>>>> >>> - With precedence orders, by always selecting the scalar function 
>>>>>>>> >>> last
>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce 
>>>>>>>> >>> function to resolve multiple values
>>>>>>>> >>> 
>>>>>>>> >>> These decisions all imply trade-offs on each other, and affect the 
>>>>>>>> >>> evolution of CQL, so I think community input would be helpful.
>>>>>>>> >>> 
>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
>>>>>>>> >>>>>  wrote:
>>>>>>>> >>>> 
>>>>>>>> >>>> 
>>>>>>>> >>>> This will require some long introduction for context:
>>>>>>>> >>>> 
>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max 
>>>>>>>> >>>> column value according to their comparator. For collections, the 
>>>>>>>> >>>> comparison is on the lexicographical order of the collection 
>>>>>>>> >>>> elements. That's the very same comparator that is used when 
>>>>>>>> >>>> collections are used as clustering keys and for ORDER BY.
>>>>>>>> >>>> 
>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make 
>>>>>>>> >>>> that the results were presented in their unserialized form, 
>>>>>>>> >>>> although the row selection was correct. That bug was recently 
>>>>>>>> >>>> solved by CASSANDRA-17811. During that ticket it was also 
>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on collection 
>>>>>>>> >>>> since applying those functions to collections, since they don't 
>>>>>>>> >>>> seem super useful. However, that option was quickly discarded and 
>>>>>>>> >>>> the operation was fixed so the MIN/MAX functions correctly work 
>>>>>>>> >>>> for every data type.
>>>>>>>> >>>> 
>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can perform 
>>>>>>>> >>>> aggregations of the elements of a collection. Those where named 
>>>>>>>> >>>> "map_keys", "map_values", "collection_min", "collection_max", 
>>>>>>>> >>>> "collection_sum", and "collection_count". Those are the names 
>>>>>>>> >>>> mentioned on the mail list thread about function naming 
>>>>>>>> >>>> conventions. Despite doing a kind of within-collection 
>>>>>>>> >>>> aggregation, these functions are not what we usually call 
>>>>>>>> >>>> aggregate functions, since they don't aggregate multiple rows 
>>>>>>>> >>>> together.
>>>>>>>> >>>> 
>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell 
>>>>>>>> >>>> column. However, the new collection functions can be used in 
>>>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the 
>>>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since 
>>>>>>>> >>>> the new functions give a generic way of aggreagting timestamps 
>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to 
>>>>>>>> >>>> remove that MAXWRITETIME function.
>>>>>>>> >>>> 
>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove 
>>>>>>>> >>>> that function in favour of the new generic collection functions. 
>>>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell 
>>>>>>>> >>>> and multi-cell columns, whereas 
>>>>>>>> >>>> "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell 
>>>>>>>> >>>> columns, That's because MAXWRITETIME of a not-multicell column 
>>>>>>>> >>>> doesn't return a collection, and one should simply use 
>>>>>>>> >>>> "WRITETIME(column)" instead. So it was proposed in 
>>>>>>>> >>>> CASSANDRA-18037 that collections functions applied to a 
>>>>>>>> >>>> not-collection value consider that value as the only element of a 
>>>>>>>> >>>> singleton collection. So, for example, COLLECTION_MAX(7) = 
>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed 
>>>>>>>> >>>> and it's mostly ready to commit.
>>>>>>>> >>>> 
>>>>>>>> >>>> Now we can go straight to the point:
>>>>>>>> >>>> 
>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>>>>>>> >>>> functions applied to collections, the very same idea that was 
>>>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. 
>>>>>>>> >>>> That way we could rename the new collection functions 
>>>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That 
>>>>>>>> >>>> way MIN/MAX/SUM/AVG would be an aggregate function when applied 
>>>>>>>> >>>> to not-collection columns, and a scalar function when applied to 
>>>>>>>> >>>> collection. We can't do that with COUNT because there would be an 
>>>>>>>> >>>> ambiguity, so the proposal for that case is renaming 
>>>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not 
>>>>>>>> >>>> correctly exposing the proposal.
>>>>>>>> >>>> 
>>>>>>>> >>>> I however would prefer to keep aggregate functions working on 
>>>>>>>> >>>> collections, and keep the names of the new collection functions 
>>>>>>>> >>>> as "COLLECTION_*". Reasons are:
>>>>>>>> >>>> 
>>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>>>>>>> >>>> cosidered as breaking backward compatibility and require a 
>>>>>>>> >>>> deprecation plan. 
>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not 
>>>>>>>> >>>> look superuseful, but they make the set of aggregate functions 
>>>>>>>> >>>> consistent and applicable to every column type.
>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions 
>>>>>>>> >>>> establishes a clear distinction between row aggregations and 
>>>>>>>> >>>> collection aggregations, while at the same time exposing the 
>>>>>>>> >>>> analogy between each pair of functions.
>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>>>>>>> >>>> workarounds such as using the column type when possible, or 
>>>>>>>> >>>> trying to figure out synonyms like in the case of COUNT/SIZE. 
>>>>>>>> >>>> Even if that works for this case, future functions can find more 
>>>>>>>> >>>> trouble when trying to figure out workarounds to avoid clashing 
>>>>>>>> >>>> with existing function names. For example, we might want to add a 
>>>>>>>> >>>> SIZE function that gets the size in bytes of any column, or we 
>>>>>>>> >>>> might want to add a MAX function that gets the maximum of a set 
>>>>>>>> >>>> of columns, etc. And example of the synonym-based approach that 
>>>>>>>> >>>> comes to mind is MySQL's MAX and GREATEST functions, where MAX is 
>>>>>>>> >>>> for row aggregation and GREATEST is for column aggregation.
>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we 
>>>>>>>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>>>>>>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>>>>>>> >>>> collection functions with time functions.
>>>>>>>> >>>> 
>>>>>>>> >>>> What do others think? What should we do with aggregate functions 
>>>>>>>> >>>> on collections, collection functions and MAXWRITETIME?
>>>>>>>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
t;>>>>>>> >>>> This will require some long introduction for context:
>>>>>>>> >>>> 
>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max 
>>>>>>>> >>>> column value according to their comparator. For collections, the 
>>>>>>>> >>>> comparison is on the lexicographical order of the collection 
>>>>>>>> >>>> elements. That's the very same comparator that is used when 
>>>>>>>> >>>> collections are used as clustering keys and for ORDER BY.
>>>>>>>> >>>> 
>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make 
>>>>>>>> >>>> that the results were presented in their unserialized form, 
>>>>>>>> >>>> although the row selection was correct. That bug was recently 
>>>>>>>> >>>> solved by CASSANDRA-17811. During that ticket it was also 
>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on collection 
>>>>>>>> >>>> since applying those functions to collections, since they don't 
>>>>>>>> >>>> seem super useful. However, that option was quickly discarded and 
>>>>>>>> >>>> the operation was fixed so the MIN/MAX functions correctly work 
>>>>>>>> >>>> for every data type.
>>>>>>>> >>>> 
>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can perform 
>>>>>>>> >>>> aggregations of the elements of a collection. Those where named 
>>>>>>>> >>>> "map_keys", "map_values", "collection_min", "collection_max", 
>>>>>>>> >>>> "collection_sum", and "collection_count". Those are the names 
>>>>>>>> >>>> mentioned on the mail list thread about function naming 
>>>>>>>> >>>> conventions. Despite doing a kind of within-collection 
>>>>>>>> >>>> aggregation, these functions are not what we usually call 
>>>>>>>> >>>> aggregate functions, since they don't aggregate multiple rows 
>>>>>>>> >>>> together.
>>>>>>>> >>>> 
>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell 
>>>>>>>> >>>> column. However, the new collection functions can be used in 
>>>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the 
>>>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since 
>>>>>>>> >>>> the new functions give a generic way of aggreagting timestamps 
>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to 
>>>>>>>> >>>> remove that MAXWRITETIME function.
>>>>>>>> >>>> 
>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove 
>>>>>>>> >>>> that function in favour of the new generic collection functions. 
>>>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell 
>>>>>>>> >>>> and multi-cell columns, whereas 
>>>>>>>> >>>> "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell 
>>>>>>>> >>>> columns, That's because MAXWRITETIME of a not-multicell column 
>>>>>>>> >>>> doesn't return a collection, and one should simply use 
>>>>>>>> >>>> "WRITETIME(column)" instead. So it was proposed in 
>>>>>>>> >>>> CASSANDRA-18037 that collections functions applied to a 
>>>>>>>> >>>> not-collection value consider that value as the only element of a 
>>>>>>>> >>>> singleton collection. So, for example, COLLECTION_MAX(7) = 
>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed 
>>>>>>>> >>>> and it's mostly ready to commit.
>>>>>>>> >>>> 
>>>>>>>> >>>> Now we can go straight to the point:
>>>>>>>> >>>> 
>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>>>>>>> >>>> functions applied to collections, the very same idea that was 
>>>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. 
>>>>>>>> >>>> That way we could rename the new collection functions 
>>>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That 
>>>>>>>> >>>> way MIN/MAX/SUM/AVG would be an aggregate function when applied 
>>>>>>>> >>>> to not-collection columns, and a scalar function when applied to 
>>>>>>>> >>>> collection. We can't do that with COUNT because there would be an 
>>>>>>>> >>>> ambiguity, so the proposal for that case is renaming 
>>>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not 
>>>>>>>> >>>> correctly exposing the proposal.
>>>>>>>> >>>> 
>>>>>>>> >>>> I however would prefer to keep aggregate functions working on 
>>>>>>>> >>>> collections, and keep the names of the new collection functions 
>>>>>>>> >>>> as "COLLECTION_*". Reasons are:
>>>>>>>> >>>> 
>>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>>>>>>> >>>> cosidered as breaking backward compatibility and require a 
>>>>>>>> >>>> deprecation plan. 
>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not 
>>>>>>>> >>>> look superuseful, but they make the set of aggregate functions 
>>>>>>>> >>>> consistent and applicable to every column type.
>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions 
>>>>>>>> >>>> establishes a clear distinction between row aggregations and 
>>>>>>>> >>>> collection aggregations, while at the same time exposing the 
>>>>>>>> >>>> analogy between each pair of functions.
>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>>>>>>> >>>> workarounds such as using the column type when possible, or 
>>>>>>>> >>>> trying to figure out synonyms like in the case of COUNT/SIZE. 
>>>>>>>> >>>> Even if that works for this case, future functions can find more 
>>>>>>>> >>>> trouble when trying to figure out workarounds to avoid clashing 
>>>>>>>> >>>> with existing function names. For example, we might want to add a 
>>>>>>>> >>>> SIZE function that gets the size in bytes of any column, or we 
>>>>>>>> >>>> might want to add a MAX function that gets the maximum of a set 
>>>>>>>> >>>> of columns, etc. And example of the synonym-based approach that 
>>>>>>>> >>>> comes to mind is MySQL's MAX and GREATEST functions, where MAX is 
>>>>>>>> >>>> for row aggregation and GREATEST is for column aggregation.
>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we 
>>>>>>>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>>>>>>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>>>>>>> >>>> collection functions with time functions.
>>>>>>>> >>>> 
>>>>>>>> >>>> What do others think? What should we do with aggregate functions 
>>>>>>>> >>>> on collections, collection functions and MAXWRITETIME?
>>>>>>>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
s to collections, since
>>>> they don't seem super useful. However, that option was quickly discarded
>>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>>> every data type.
>>>> >>>>
>>>> >>>> As a byproduct of the internal improvements of that fix,
>>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>>> aggregations of the elements of a collection. Those where named "map_keys",
>>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>>> "collection_count". Those are the names mentioned on the mail list thread
>>>> about function naming conventions. Despite doing a kind of
>>>> within-collection aggregation, these functions are not what we usually call
>>>> aggregate functions, since they don't aggregate multiple rows together.
>>>> >>>>
>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>>> However, the new collection functions can be used in combination with the
>>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>>> ttl of a multi-cell column. Since the new functions give a generic way of
>>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>>> proposed to remove that MAXWRITETIME function.
>>>> >>>>
>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove
>>>> that function in favour of the new generic collection functions. However,
>>>> the MAXWRITETIME function can work on both single-cell and multi-cell
>>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>>>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>>>> doesn't return a collection, and one should simply use "WRITETIME(column)"
>>>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>>>> applied to a not-collection value consider that value as the only element
>>>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>>>> mostly ready to commit.
>>>> >>>>
>>>> >>>> Now we can go straight to the point:
>>>> >>>>
>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>>>> functions applied to collections, the very same idea that was mentioned on
>>>> CASSANDRA-17811 description almost four months ago. That way we could
>>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>>>> function when applied to not-collection columns, and a scalar function when
>>>> applied to collection. We can't do that with COUNT because there would be
>>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
>>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>>> proposal.
>>>> >>>>
>>>> >>>> I however would prefer to keep aggregate functions working on
>>>> collections, and keep the names of the new collection functions as
>>>> "COLLECTION_*". Reasons are:
>>>> >>>>
>>>> >>>> 1 - Making aggregate functions not work on collections might be
>>>> cosidered as breaking backward compatibility and require a deprecation
>>>> plan.
>>>> >>>> 2 - Keeping aggregate functions working on collections might not
>>>> look superuseful, but they make the set of aggregate functions consistent
>>>> and applicable to every column type.
>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>>> establishes a clear distinction between row aggregations and collection
>>>> aggregations, while at the same time exposing the analogy between each pair
>>>> of functions.
>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>>> workarounds such as using the column type when possible, or trying to
>>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
>>>> this case, future functions can find more trouble when trying to figure out
>>>> workarounds to avoid clashing with existing function names. For example, we
>>>> might want to add a SIZE function that gets the size in bytes of any
>>>> column, or we might want to add a MAX function that gets the maximum of a
>>>> set of columns, etc. And example of the synonym-based approach that comes
>>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>>> aggregation and GREATEST is for column aggregation.
>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we
>>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>>> collection functions with time functions.
>>>> >>>>
>>>> >>>> What do others think? What should we do with aggregate functions
>>>> on collections, collection functions and MAXWRITETIME?
>>>> >>
>>>>
>>>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can perform 
>>>>>> >>>> aggregations of the elements of a collection. Those where named 
>>>>>> >>>> "map_keys", "map_values", "collection_min", "collection_max", 
>>>>>> >>>> "collection_sum", and "collection_count". Those are the names 
>>>>>> >>>> mentioned on the mail list thread about function naming 
>>>>>> >>>> conventions. Despite doing a kind of within-collection aggregation, 
>>>>>> >>>> these functions are not what we usually call aggregate functions, 
>>>>>> >>>> since they don't aggregate multiple rows together.
>>>>>> >>>> 
>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell 
>>>>>> >>>> column. However, the new collection functions can be used in 
>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the 
>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since the 
>>>>>> >>>> new functions give a generic way of aggreagting timestamps ant TTLs 
>>>>>> >>>> of multi-cell columns, CASSANDRA-18078 proposed to remove that 
>>>>>> >>>> MAXWRITETIME function.
>>>>>> >>>> 
>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove 
>>>>>> >>>> that function in favour of the new generic collection functions. 
>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell and 
>>>>>> >>>> multi-cell columns, whereas "COLLECTION_MAX(WRITETIME(column))" 
>>>>>> >>>> would only work on multi-cell columns, That's because MAXWRITETIME 
>>>>>> >>>> of a not-multicell column doesn't return a collection, and one 
>>>>>> >>>> should simply use "WRITETIME(column)" instead. So it was proposed 
>>>>>> >>>> in CASSANDRA-18037 that collections functions applied to a 
>>>>>> >>>> not-collection value consider that value as the only element of a 
>>>>>> >>>> singleton collection. So, for example, COLLECTION_MAX(7) = 
>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and 
>>>>>> >>>> it's mostly ready to commit.
>>>>>> >>>> 
>>>>>> >>>> Now we can go straight to the point:
>>>>>> >>>> 
>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>>>>> >>>> functions applied to collections, the very same idea that was 
>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. 
>>>>>> >>>> That way we could rename the new collection functions 
>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That way 
>>>>>> >>>> MIN/MAX/SUM/AVG would be an aggregate function when applied to 
>>>>>> >>>> not-collection columns, and a scalar function when applied to 
>>>>>> >>>> collection. We can't do that with COUNT because there would be an 
>>>>>> >>>> ambiguity, so the proposal for that case is renaming 
>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not 
>>>>>> >>>> correctly exposing the proposal.
>>>>>> >>>> 
>>>>>> >>>> I however would prefer to keep aggregate functions working on 
>>>>>> >>>> collections, and keep the names of the new collection functions as 
>>>>>> >>>> "COLLECTION_*". Reasons are:
>>>>>> >>>> 
>>>>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>>>>> >>>> cosidered as breaking backward compatibility and require a 
>>>>>> >>>> deprecation plan. 
>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not 
>>>>>> >>>> look superuseful, but they make the set of aggregate functions 
>>>>>> >>>> consistent and applicable to every column type.
>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions 
>>>>>> >>>> establishes a clear distinction between row aggregations and 
>>>>>> >>>> collection aggregations, while at the same time exposing the 
>>>>>> >>>> analogy between each pair of functions.
>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>>>>> >>>> workarounds such as using the column type when possible, or trying 
>>>>>> >>>> to figure out synonyms like in the case of COUNT/SIZE. Even if that 
>>>>>> >>>> works for this case, future functions can find more trouble when 
>>>>>> >>>> trying to figure out workarounds to avoid clashing with existing 
>>>>>> >>>> function names. For example, we might want to add a SIZE function 
>>>>>> >>>> that gets the size in bytes of any column, or we might want to add 
>>>>>> >>>> a MAX function that gets the maximum of a set of columns, etc. And 
>>>>>> >>>> example of the synonym-based approach that comes to mind is MySQL's 
>>>>>> >>>> MAX and GREATEST functions, where MAX is for row aggregation and 
>>>>>> >>>> GREATEST is for column aggregation.
>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we 
>>>>>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>>>>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>>>>> >>>> collection functions with time functions.
>>>>>> >>>> 
>>>>>> >>>> What do others think? What should we do with aggregate functions on 
>>>>>> >>>> collections, collection functions and MAXWRITETIME?
>>>>>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Andrés de la Peña
t;>> the results were presented in their unserialized form, although the row
>>> selection was correct. That bug was recently solved by CASSANDRA-17811.
>>> During that ticket it was also considered the option of simply disabling
>>> MIN/MAX on collection since applying those functions to collections, since
>>> they don't seem super useful. However, that option was quickly discarded
>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>> every data type.
>>> >>>>
>>> >>>> As a byproduct of the internal improvements of that fix,
>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>> aggregations of the elements of a collection. Those where named "map_keys",
>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>> "collection_count". Those are the names mentioned on the mail list thread
>>> about function naming conventions. Despite doing a kind of
>>> within-collection aggregation, these functions are not what we usually call
>>> aggregate functions, since they don't aggregate multiple rows together.
>>> >>>>
>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>> However, the new collection functions can be used in combination with the
>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>> ttl of a multi-cell column. Since the new functions give a generic way of
>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>> proposed to remove that MAXWRITETIME function.
>>> >>>>
>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove
>>> that function in favour of the new generic collection functions. However,
>>> the MAXWRITETIME function can work on both single-cell and multi-cell
>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>>> doesn't return a collection, and one should simply use "WRITETIME(column)"
>>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>>> applied to a not-collection value consider that value as the only element
>>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>>> mostly ready to commit.
>>> >>>>
>>> >>>> Now we can go straight to the point:
>>> >>>>
>>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>>> functions applied to collections, the very same idea that was mentioned on
>>> CASSANDRA-17811 description almost four months ago. That way we could
>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>>> function when applied to not-collection columns, and a scalar function when
>>> applied to collection. We can't do that with COUNT because there would be
>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>> proposal.
>>> >>>>
>>> >>>> I however would prefer to keep aggregate functions working on
>>> collections, and keep the names of the new collection functions as
>>> "COLLECTION_*". Reasons are:
>>> >>>>
>>> >>>> 1 - Making aggregate functions not work on collections might be
>>> cosidered as breaking backward compatibility and require a deprecation
>>> plan.
>>> >>>> 2 - Keeping aggregate functions working on collections might not
>>> look superuseful, but they make the set of aggregate functions consistent
>>> and applicable to every column type.
>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>> establishes a clear distinction between row aggregations and collection
>>> aggregations, while at the same time exposing the analogy between each pair
>>> of functions.
>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>> workarounds such as using the column type when possible, or trying to
>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
>>> this case, future functions can find more trouble when trying to figure out
>>> workarounds to avoid clashing with existing function names. For example, we
>>> might want to add a SIZE function that gets the size in bytes of any
>>> column, or we might want to add a MAX function that gets the maximum of a
>>> set of columns, etc. And example of the synonym-based approach that comes
>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>> aggregation and GREATEST is for column aggregation.
>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we
>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>> collection functions with time functions.
>>> >>>>
>>> >>>> What do others think? What should we do with aggregate functions on
>>> collections, collection functions and MAXWRITETIME?
>>> >>
>>>
>>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Claude Warren, Jr via dev
c collection functions. However,
>> the MAXWRITETIME function can work on both single-cell and multi-cell
>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>> doesn't return a collection, and one should simply use "WRITETIME(column)"
>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>> applied to a not-collection value consider that value as the only element
>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>> mostly ready to commit.
>> >>>>
>> >>>> Now we can go straight to the point:
>> >>>>
>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>> functions applied to collections, the very same idea that was mentioned on
>> CASSANDRA-17811 description almost four months ago. That way we could
>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>> function when applied to not-collection columns, and a scalar function when
>> applied to collection. We can't do that with COUNT because there would be
>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>> proposal.
>> >>>>
>> >>>> I however would prefer to keep aggregate functions working on
>> collections, and keep the names of the new collection functions as
>> "COLLECTION_*". Reasons are:
>> >>>>
>> >>>> 1 - Making aggregate functions not work on collections might be
>> cosidered as breaking backward compatibility and require a deprecation
>> plan.
>> >>>> 2 - Keeping aggregate functions working on collections might not
>> look superuseful, but they make the set of aggregate functions consistent
>> and applicable to every column type.
>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>> establishes a clear distinction between row aggregations and collection
>> aggregations, while at the same time exposing the analogy between each pair
>> of functions.
>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>> workarounds such as using the column type when possible, or trying to
>> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
>> this case, future functions can find more trouble when trying to figure out
>> workarounds to avoid clashing with existing function names. For example, we
>> might want to add a SIZE function that gets the size in bytes of any
>> column, or we might want to add a MAX function that gets the maximum of a
>> set of columns, etc. And example of the synonym-based approach that comes
>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>> aggregation and GREATEST is for column aggregation.
>> >>>> 5 - If MIN/MAX function selection is based on the column type, we
>> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>> collection functions with time functions.
>> >>>>
>> >>>> What do others think? What should we do with aggregate functions on
>> collections, collection functions and MAXWRITETIME?
>> >>
>>
>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
work, CASSANDRA-17425 added to trunk a 
>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell column. 
>>> >>>> However, the new collection functions can be used in combination with 
>>> >>>> the WRITETIME and TTL functions to retrieve the min/max/sum/avg 
>>> >>>> timestamp or ttl of a multi-cell column. Since the new functions give 
>>> >>>> a generic way of aggreagting timestamps ant TTLs of multi-cell 
>>> >>>> columns, CASSANDRA-18078 proposed to remove that MAXWRITETIME function.
>>> >>>> 
>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that 
>>> >>>> function in favour of the new generic collection functions. However, 
>>> >>>> the MAXWRITETIME function can work on both single-cell and multi-cell 
>>> >>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work 
>>> >>>> on multi-cell columns, That's because MAXWRITETIME of a not-multicell 
>>> >>>> column doesn't return a collection, and one should simply use 
>>> >>>> "WRITETIME(column)" instead. So it was proposed in CASSANDRA-18037 
>>> >>>> that collections functions applied to a not-collection value consider 
>>> >>>> that value as the only element of a singleton collection. So, for 
>>> >>>> example, COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. That ticket has 
>>> >>>> already been reviewed and it's mostly ready to commit.
>>> >>>> 
>>> >>>> Now we can go straight to the point:
>>> >>>> 
>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>> >>>> functions applied to collections, the very same idea that was 
>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. That 
>>> >>>> way we could rename the new collection functions MIN/MAX/SUM/AVG, same 
>>> >>>> as the classic aggregate functions. That way MIN/MAX/SUM/AVG would be 
>>> >>>> an aggregate function when applied to not-collection columns, and a 
>>> >>>> scalar function when applied to collection. We can't do that with 
>>> >>>> COUNT because there would be an ambiguity, so the proposal for that 
>>> >>>> case is renaming COLLECTION_COUNT to SIZE. Benedict, please correct me 
>>> >>>> if I'm not correctly exposing the proposal.
>>> >>>> 
>>> >>>> I however would prefer to keep aggregate functions working on 
>>> >>>> collections, and keep the names of the new collection functions as 
>>> >>>> "COLLECTION_*". Reasons are:
>>> >>>> 
>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>> >>>> cosidered as breaking backward compatibility and require a deprecation 
>>> >>>> plan. 
>>> >>>> 2 - Keeping aggregate functions working on collections might not look 
>>> >>>> superuseful, but they make the set of aggregate functions consistent 
>>> >>>> and applicable to every column type.
>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes 
>>> >>>> a clear distinction between row aggregations and collection 
>>> >>>> aggregations, while at the same time exposing the analogy between each 
>>> >>>> pair of functions.
>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>> >>>> workarounds such as using the column type when possible, or trying to 
>>> >>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works 
>>> >>>> for this case, future functions can find more trouble when trying to 
>>> >>>> figure out workarounds to avoid clashing with existing function names. 
>>> >>>> For example, we might want to add a SIZE function that gets the size 
>>> >>>> in bytes of any column, or we might want to add a MAX function that 
>>> >>>> gets the maximum of a set of columns, etc. And example of the 
>>> >>>> synonym-based approach that comes to mind is MySQL's MAX and GREATEST 
>>> >>>> functions, where MAX is for row aggregation and GREATEST is for column 
>>> >>>> aggregation.
>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we 
>>> >>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>> >>>> collection functions with time functions.
>>> >>>> 
>>> >>>> What do others think? What should we do with aggregate functions on 
>>> >>>> collections, collection functions and MAXWRITETIME?
>>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Avi Kivity via dev
MAXWRITETIME function, agreed to remove
> > that function in favour of the new generic collection functions.
> > However, the MAXWRITETIME function can work on both single-cell and
> > multi-cell columns, whereas "COLLECTION_MAX(WRITETIME(column))"
> > would only work on multi-cell columns, That's because MAXWRITETIME
> > of a not-multicell column doesn't return a collection, and one
> > should simply use "WRITETIME(column)" instead. So it was proposed
> > in CASSANDRA-18037 that collections functions applied to a not-
> > collection value consider that value as the only element of a
> > singleton collection. So, for example, COLLECTION_MAX(7) =
> > COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and
> > it's mostly ready to commit.
> > 
> > Now we can go straight to the point:
> > 
> > Recently Benedict brought back the idea of deprecating aggregate
> > functions applied to collections, the very same idea that was
> > mentioned on CASSANDRA-17811 description almost four months ago.
> > That way we could rename the new collection functions
> > MIN/MAX/SUM/AVG, same as the classic aggregate functions. That way
> > MIN/MAX/SUM/AVG would be an aggregate function when applied to not-
> > collection columns, and a scalar function when applied to
> > collection. We can't do that with COUNT because there would be an
> > ambiguity, so the proposal for that case is renaming
> > COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not
> > correctly exposing the proposal.
> > 
> > I however would prefer to keep aggregate functions working on
> > collections, and keep the names of the new collection functions as
> > "COLLECTION_*". Reasons are:
> > 
> > 1 - Making aggregate functions not work on collections might be
> > cosidered as breaking backward compatibility and require a
> > deprecation plan. 
> > 2 - Keeping aggregate functions working on collections might not
> > look superuseful, but they make the set of aggregate functions
> > consistent and applicable to every column type.
> > 3 - Using the "COLLECTION_" prefix on collection functions
> > establishes a clear distinction between row aggregations and
> > collection aggregations, while at the same time exposing the
> > analogy between each pair of functions.
> > 4 - Not using the "COLLECTION_" prefix forces us to search for
> > workarounds such as using the column type when possible, or trying
> > to figure out synonyms like in the case of COUNT/SIZE. Even if that
> > works for this case, future functions can find more trouble when
> > trying to figure out workarounds to avoid clashing with existing
> > function names. For example, we might want to add a SIZE function
> > that gets the size in bytes of any column, or we might want to add
> > a MAX function that gets the maximum of a set of columns, etc. And
> > example of the synonym-based approach that comes to mind is MySQL's
> > MAX and GREATEST functions, where MAX is for row aggregation and
> > GREATEST is for column aggregation.
> > 5 - If MIN/MAX function selection is based on the column type, we
> > can't implement Yifan's proposal of making COLLECTION_MAX(7) =
> > COLLECTION_MAX([7]) = 7, which would be very useful for combining
> > collection functions with time functions.
> > 
> > What do others think? What should we do with aggregate functions on
> > collections, collection functions and MAXWRITETIME?



Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-08 Thread Benedict
n't return a collection, and one should simply use 
>> >>>> "WRITETIME(column)" instead. So it was proposed in CASSANDRA-18037 that 
>> >>>> collections functions applied to a not-collection value consider that 
>> >>>> value as the only element of a singleton collection. So, for example, 
>> >>>> COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7. That ticket has already 
>> >>>> been reviewed and it's mostly ready to commit.
>> >>>> 
>> >>>> Now we can go straight to the point:
>> >>>> 
>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>> >>>> functions applied to collections, the very same idea that was mentioned 
>> >>>> on CASSANDRA-17811 description almost four months ago. That way we 
>> >>>> could rename the new collection functions MIN/MAX/SUM/AVG, same as the 
>> >>>> classic aggregate functions. That way MIN/MAX/SUM/AVG would be an 
>> >>>> aggregate function when applied to not-collection columns, and a scalar 
>> >>>> function when applied to collection. We can't do that with COUNT 
>> >>>> because there would be an ambiguity, so the proposal for that case is 
>> >>>> renaming COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm 
>> >>>> not correctly exposing the proposal.
>> >>>> 
>> >>>> I however would prefer to keep aggregate functions working on 
>> >>>> collections, and keep the names of the new collection functions as 
>> >>>> "COLLECTION_*". Reasons are:
>> >>>> 
>> >>>> 1 - Making aggregate functions not work on collections might be 
>> >>>> cosidered as breaking backward compatibility and require a deprecation 
>> >>>> plan. 
>> >>>> 2 - Keeping aggregate functions working on collections might not look 
>> >>>> superuseful, but they make the set of aggregate functions consistent 
>> >>>> and applicable to every column type.
>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes 
>> >>>> a clear distinction between row aggregations and collection 
>> >>>> aggregations, while at the same time exposing the analogy between each 
>> >>>> pair of functions.
>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>> >>>> workarounds such as using the column type when possible, or trying to 
>> >>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works 
>> >>>> for this case, future functions can find more trouble when trying to 
>> >>>> figure out workarounds to avoid clashing with existing function names. 
>> >>>> For example, we might want to add a SIZE function that gets the size in 
>> >>>> bytes of any column, or we might want to add a MAX function that gets 
>> >>>> the maximum of a set of columns, etc. And example of the synonym-based 
>> >>>> approach that comes to mind is MySQL's MAX and GREATEST functions, 
>> >>>> where MAX is for row aggregation and GREATEST is for column aggregation.
>> >>>> 5 - If MIN/MAX function selection is based on the column type, we can't 
>> >>>> implement Yifan's proposal of making COLLECTION_MAX(7) = 
>> >>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>> >>>> collection functions with time functions.
>> >>>> 
>> >>>> What do others think? What should we do with aggregate functions on 
>> >>>> collections, collection functions and MAXWRITETIME?
>> >> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-07 Thread Andrés de la Peña
XWRITETIME function to get the max timestamp of a multi-cell column.
> However, the new collection functions can be used in combination with the
> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
> ttl of a multi-cell column. Since the new functions give a generic way of
> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
> proposed to remove that MAXWRITETIME function.
> >>>>
> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that
> function in favour of the new generic collection functions. However, the
> MAXWRITETIME function can work on both single-cell and multi-cell columns,
> whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell
> columns, That's because MAXWRITETIME of a not-multicell column doesn't
> return a collection, and one should simply use "WRITETIME(column)" instead.
> So it was proposed in CASSANDRA-18037 that collections functions applied to
> a not-collection value consider that value as the only element of a
> singleton collection. So, for example, COLLECTION_MAX(7) =
> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
> mostly ready to commit.
> >>>>
> >>>> Now we can go straight to the point:
> >>>>
> >>>> Recently Benedict brought back the idea of deprecating aggregate
> functions applied to collections, the very same idea that was mentioned on
> CASSANDRA-17811 description almost four months ago. That way we could
> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
> function when applied to not-collection columns, and a scalar function when
> applied to collection. We can't do that with COUNT because there would be
> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
> SIZE. Benedict, please correct me if I'm not correctly exposing the
> proposal.
> >>>>
> >>>> I however would prefer to keep aggregate functions working on
> collections, and keep the names of the new collection functions as
> "COLLECTION_*". Reasons are:
> >>>>
> >>>> 1 - Making aggregate functions not work on collections might be
> cosidered as breaking backward compatibility and require a deprecation
> plan.
> >>>> 2 - Keeping aggregate functions working on collections might not look
> superuseful, but they make the set of aggregate functions consistent and
> applicable to every column type.
> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
> establishes a clear distinction between row aggregations and collection
> aggregations, while at the same time exposing the analogy between each pair
> of functions.
> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
> workarounds such as using the column type when possible, or trying to
> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
> this case, future functions can find more trouble when trying to figure out
> workarounds to avoid clashing with existing function names. For example, we
> might want to add a SIZE function that gets the size in bytes of any
> column, or we might want to add a MAX function that gets the maximum of a
> set of columns, etc. And example of the synonym-based approach that comes
> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
> aggregation and GREATEST is for column aggregation.
> >>>> 5 - If MIN/MAX function selection is based on the column type, we
> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
> COLLECTION_MAX([7]) = 7, which would be very useful for combining
> collection functions with time functions.
> >>>>
> >>>> What do others think? What should we do with aggregate functions on
> collections, collection functions and MAXWRITETIME?
> >>
>


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread J. D. Jordan
MIN/MAX/SUM/AVG would be an aggregate 
>>>> function when applied to not-collection columns, and a scalar function 
>>>> when applied to collection. We can't do that with COUNT because there 
>>>> would be an ambiguity, so the proposal for that case is renaming 
>>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not correctly 
>>>> exposing the proposal.
>>>> 
>>>> I however would prefer to keep aggregate functions working on collections, 
>>>> and keep the names of the new collection functions as "COLLECTION_*". 
>>>> Reasons are:
>>>> 
>>>> 1 - Making aggregate functions not work on collections might be cosidered 
>>>> as breaking backward compatibility and require a deprecation plan. 
>>>> 2 - Keeping aggregate functions working on collections might not look 
>>>> superuseful, but they make the set of aggregate functions consistent and 
>>>> applicable to every column type.
>>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes a 
>>>> clear distinction between row aggregations and collection aggregations, 
>>>> while at the same time exposing the analogy between each pair of functions.
>>>> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds 
>>>> such as using the column type when possible, or trying to figure out 
>>>> synonyms like in the case of COUNT/SIZE. Even if that works for this case, 
>>>> future functions can find more trouble when trying to figure out 
>>>> workarounds to avoid clashing with existing function names. For example, 
>>>> we might want to add a SIZE function that gets the size in bytes of any 
>>>> column, or we might want to add a MAX function that gets the maximum of a 
>>>> set of columns, etc. And example of the synonym-based approach that comes 
>>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row 
>>>> aggregation and GREATEST is for column aggregation.
>>>> 5 - If MIN/MAX function selection is based on the column type, we can't 
>>>> implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>>> collection functions with time functions.
>>>> 
>>>> What do others think? What should we do with aggregate functions on 
>>>> collections, collection functions and MAXWRITETIME?
>> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread Benedict
cographical order of the collection elements. That's the very same 
>>> comparator that is used when collections are used as clustering keys and 
>>> for ORDER BY.
>>> 
>>> However, a bug in the MIN/MAX aggregate functions used to make that the 
>>> results were presented in their unserialized form, although the row 
>>> selection was correct. That bug was recently solved by CASSANDRA-17811. 
>>> During that ticket it was also considered the option of simply disabling 
>>> MIN/MAX on collection since applying those functions to collections, since 
>>> they don't seem super useful. However, that option was quickly discarded 
>>> and the operation was fixed so the MIN/MAX functions correctly work for 
>>> every data type.
>>> 
>>> As a byproduct of the internal improvements of that fix, CASSANDRA-8877 
>>> introduced a new set of functions that can perform aggregations of the 
>>> elements of a collection. Those where named "map_keys", "map_values", 
>>> "collection_min", "collection_max", "collection_sum", and 
>>> "collection_count". Those are the names mentioned on the mail list thread 
>>> about function naming conventions. Despite doing a kind of 
>>> within-collection aggregation, these functions are not what we usually call 
>>> aggregate functions, since they don't aggregate multiple rows together.
>>> 
>>> On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME 
>>> function to get the max timestamp of a multi-cell column. However, the new 
>>> collection functions can be used in combination with the WRITETIME and TTL 
>>> functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell 
>>> column. Since the new functions give a generic way of aggreagting 
>>> timestamps ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to 
>>> remove that MAXWRITETIME function.
>>> 
>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that 
>>> function in favour of the new generic collection functions. However, the 
>>> MAXWRITETIME function can work on both single-cell and multi-cell columns, 
>>> whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell 
>>> columns, That's because MAXWRITETIME of a not-multicell column doesn't 
>>> return a collection, and one should simply use "WRITETIME(column)" instead. 
>>> So it was proposed in CASSANDRA-18037 that collections functions applied to 
>>> a not-collection value consider that value as the only element of a 
>>> singleton collection. So, for example, COLLECTION_MAX(7) = 
>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's 
>>> mostly ready to commit.
>>> 
>>> Now we can go straight to the point:
>>> 
>>> Recently Benedict brought back the idea of deprecating aggregate functions 
>>> applied to collections, the very same idea that was mentioned on 
>>> CASSANDRA-17811 description almost four months ago. That way we could 
>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic 
>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate 
>>> function when applied to not-collection columns, and a scalar function when 
>>> applied to collection. We can't do that with COUNT because there would be 
>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to 
>>> SIZE. Benedict, please correct me if I'm not correctly exposing the 
>>> proposal.
>>> 
>>> I however would prefer to keep aggregate functions working on collections, 
>>> and keep the names of the new collection functions as "COLLECTION_*". 
>>> Reasons are:
>>> 
>>> 1 - Making aggregate functions not work on collections might be cosidered 
>>> as breaking backward compatibility and require a deprecation plan. 
>>> 2 - Keeping aggregate functions working on collections might not look 
>>> superuseful, but they make the set of aggregate functions consistent and 
>>> applicable to every column type.
>>> 3 - Using the "COLLECTION_" prefix on collection functions establishes a 
>>> clear distinction between row aggregations and collection aggregations, 
>>> while at the same time exposing the analogy between each pair of functions.
>>> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds 
>>> such as using the column type when possible, or trying to figure out 
>>> synonyms like in the case of COUNT/SIZE. Even if that works for this case, 
>>> future functions can find more trouble when trying to figure out 
>>> workarounds to avoid clashing with existing function names. For example, we 
>>> might want to add a SIZE function that gets the size in bytes of any 
>>> column, or we might want to add a MAX function that gets the maximum of a 
>>> set of columns, etc. And example of the synonym-based approach that comes 
>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row 
>>> aggregation and GREATEST is for column aggregation.
>>> 5 - If MIN/MAX function selection is based on the column type, we can't 
>>> implement Yifan's proposal of making COLLECTION_MAX(7) = 
>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining 
>>> collection functions with time functions.
>>> 
>>> What do others think? What should we do with aggregate functions on 
>>> collections, collection functions and MAXWRITETIME?
> 


Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread Jeremiah D Jordan
;> the operation was fixed so the MIN/MAX functions correctly work for every 
>> data type.
>> 
>> As a byproduct of the internal improvements of that fix, CASSANDRA-8877 
>> introduced a new set of functions that can perform aggregations of the 
>> elements of a collection. Those where named "map_keys", "map_values", 
>> "collection_min", "collection_max", "collection_sum", and 
>> "collection_count". Those are the names mentioned on the mail list thread 
>> about function naming conventions. Despite doing a kind of within-collection 
>> aggregation, these functions are not what we usually call aggregate 
>> functions, since they don't aggregate multiple rows together.
>> 
>> On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME 
>> function to get the max timestamp of a multi-cell column. However, the new 
>> collection functions can be used in combination with the WRITETIME and TTL 
>> functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell 
>> column. Since the new functions give a generic way of aggreagting timestamps 
>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to remove that 
>> MAXWRITETIME function.
>> 
>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove that 
>> function in favour of the new generic collection functions. However, the 
>> MAXWRITETIME function can work on both single-cell and multi-cell columns, 
>> whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell 
>> columns, That's because MAXWRITETIME of a not-multicell column doesn't 
>> return a collection, and one should simply use "WRITETIME(column)" instead. 
>> So it was proposed in CASSANDRA-18037 that collections functions applied to 
>> a not-collection value consider that value as the only element of a 
>> singleton collection. So, for example, COLLECTION_MAX(7) = 
>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's 
>> mostly ready to commit.
>> 
>> Now we can go straight to the point:
>> 
>> Recently Benedict brought back the idea of deprecating aggregate functions 
>> applied to collections, the very same idea that was mentioned on 
>> CASSANDRA-17811 description almost four months ago. That way we could rename 
>> the new collection functions MIN/MAX/SUM/AVG, same as the classic aggregate 
>> functions. That way MIN/MAX/SUM/AVG would be an aggregate function when 
>> applied to not-collection columns, and a scalar function when applied to 
>> collection. We can't do that with COUNT because there would be an ambiguity, 
>> so the proposal for that case is renaming COLLECTION_COUNT to SIZE. 
>> Benedict, please correct me if I'm not correctly exposing the proposal.
>> 
>> I however would prefer to keep aggregate functions working on collections, 
>> and keep the names of the new collection functions as "COLLECTION_*". 
>> Reasons are:
>> 
>> 1 - Making aggregate functions not work on collections might be cosidered as 
>> breaking backward compatibility and require a deprecation plan. 
>> 2 - Keeping aggregate functions working on collections might not look 
>> superuseful, but they make the set of aggregate functions consistent and 
>> applicable to every column type.
>> 3 - Using the "COLLECTION_" prefix on collection functions establishes a 
>> clear distinction between row aggregations and collection aggregations, 
>> while at the same time exposing the analogy between each pair of functions.
>> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds 
>> such as using the column type when possible, or trying to figure out 
>> synonyms like in the case of COUNT/SIZE. Even if that works for this case, 
>> future functions can find more trouble when trying to figure out workarounds 
>> to avoid clashing with existing function names. For example, we might want 
>> to add a SIZE function that gets the size in bytes of any column, or we 
>> might want to add a MAX function that gets the maximum of a set of columns, 
>> etc. And example of the synonym-based approach that comes to mind is MySQL's 
>> MAX and GREATEST functions, where MAX is for row aggregation and GREATEST is 
>> for column aggregation.
>> 5 - If MIN/MAX function selection is based on the column type, we can't 
>> implement Yifan's proposal of making COLLECTION_MAX(7) = COLLECTION_MAX([7]) 
>> = 7, which would be very useful for combining collection functions with time 
>> functions.
>> 
>> What do others think? What should we do with aggregate functions on 
>> collections, collection functions and MAXWRITETIME?



Re: Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread Benedict
w collection functions MIN/MAX/SUM/AVG, same as the classic aggregate 
> functions. That way MIN/MAX/SUM/AVG would be an aggregate function when 
> applied to not-collection columns, and a scalar function when applied to 
> collection. We can't do that with COUNT because there would be an ambiguity, 
> so the proposal for that case is renaming COLLECTION_COUNT to SIZE. Benedict, 
> please correct me if I'm not correctly exposing the proposal.
> 
> I however would prefer to keep aggregate functions working on collections, 
> and keep the names of the new collection functions as "COLLECTION_*". Reasons 
> are:
> 
> 1 - Making aggregate functions not work on collections might be cosidered as 
> breaking backward compatibility and require a deprecation plan. 
> 2 - Keeping aggregate functions working on collections might not look 
> superuseful, but they make the set of aggregate functions consistent and 
> applicable to every column type.
> 3 - Using the "COLLECTION_" prefix on collection functions establishes a 
> clear distinction between row aggregations and collection aggregations, while 
> at the same time exposing the analogy between each pair of functions.
> 4 - Not using the "COLLECTION_" prefix forces us to search for workarounds 
> such as using the column type when possible, or trying to figure out synonyms 
> like in the case of COUNT/SIZE. Even if that works for this case, future 
> functions can find more trouble when trying to figure out workarounds to 
> avoid clashing with existing function names. For example, we might want to 
> add a SIZE function that gets the size in bytes of any column, or we might 
> want to add a MAX function that gets the maximum of a set of columns, etc. 
> And example of the synonym-based approach that comes to mind is MySQL's MAX 
> and GREATEST functions, where MAX is for row aggregation and GREATEST is for 
> column aggregation.
> 5 - If MIN/MAX function selection is based on the column type, we can't 
> implement Yifan's proposal of making COLLECTION_MAX(7) = COLLECTION_MAX([7]) 
> = 7, which would be very useful for combining collection functions with time 
> functions.
> 
> What do others think? What should we do with aggregate functions on 
> collections, collection functions and MAXWRITETIME?


Aggregate functions on collections, collection functions and MAXWRITETIME

2022-12-06 Thread Andrés de la Peña
X([7]) = 7, which would be very useful for combining
collection functions with time functions.

What do others think? What should we do with aggregate functions on
collections, collection functions and MAXWRITETIME?