Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Andrés de la Peña Thu, 08 Dec 2022 04:11:32 -0800

"ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in a list
are not an uncommon practice. You can find them in SparkSQL, Amazon
Timestream, Teradata, etc. Since we have what we call collections instead
or arrays, it makes sense to call the analogous functions "COLLECTION_MAX",
"COLLECTION_MIN", etc.


As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and "MAP_VALUES"
functions to get the keys or the values of a map, so one can feed them to
"MAX", "COLLECTION_MAX", etc. That isn't anything too original either, you
can find identical functions on SparkSQL for example.

I find simple utility functions easier to use than subqueries. But we don't
have to chose. We can also have subqueries if someone finds the time to
work on them.

On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev <
[email protected]> wrote:

>  I think the semantics of the situation is important here.
>
>
> Let’s take MAX as our example aggregate function..
>
>
> We all expect that in a DB context MAX(column) will return the value of
> the column with the maximum value. That is the expected semantics of MAX.
>
>
> The question here is that there are data types that are multi-valued and
> there is a desire to apply MAX to the values within the column. I would
> expect that this would return the maximum value of the column every row in
> the DB.
>
>
> So if there were a keyword that operated like the Java BiFunction class
> where the Function would apply a second function to the column data. For
> purposes of this discussion let’s call this Function APPLY.
>
>
> So APPLY( MAX, column ) would return the maximum value from the column for
> each row in the DB.
>
>
> MAX(APPLY(MAX,column)) would get the maximum value from the column across
> all the rows.
>
>
> Similarly APPLY could be used with other functions MAX(APPLY(MIN,column))
> the largest minimum value from the column across all rows.
>
>
> These statements make clear semantically what is being asked for.
>
> On Thu, Dec 8, 2022 at 10:57 AM Benedict <[email protected]> wrote:
>
>> I meant unnest, not unwrap.
>>
>> On 8 Dec 2022, at 10:34, Benedict <[email protected]> wrote:
>>
>> 
>> 
>>
>> I do not think we should have functions that aggregate across rows and
>> functions that operate within a row use the same name.
>>
>>
>> I’m sympathetic to that view for sure. I wouldn’t be too disappointed by
>> that outcome, and SQL engines seem to take a similar approach, however they
>> mostly rely on sub-queries to get around this problem, and the SQL standard
>> introduces UNWRAP for operating on arrays (by translating them into a
>> table), permitting subqueries to aggregate them. It seems to me we have
>> four options:
>>
>> 1) introduce functionality similar to UNWRAP and subqueries
>> 2) introduce new syntax to permit operating on collections with the same
>> functions
>> 3) permit the same functions to operate on both, with a precedence order,
>> and introduce syntax to permit breaking the precedence order
>> 4) introduce new functions
>>
>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS
>> max_item FROM table
>>
>> (2) and (3) might look something like:
>>
>> SELECT MAX(list AS COLLECTION) or
>> SELECT MAX(list AS ROWS)
>>
>> (4) might look something like we have already, but perhaps with different
>> names
>>
>> The comparator for collections is the lexicographical compare on the
>> collection items
>>
>>
>> This is a fair point, I mistakenly thought it sorted first on size. Even
>> this definition is a little funkier for Map types, where the values of a
>> key may cause something to sort earlier than a map whose next key sorts
>> first. There are multiple potential lexicographical sorts for Maps (i.e.,
>> by keys first, then values, or by (key, value) pairs), so this is
>> particularly poorly defined IMO.
>>
>> The maximum of a blob type is pretty well defined I think, as are
>> boolean, inetaddress etc. However, even for List or Set collections there’s
>> multiple reasonable functions one could define for maximum, so it would
>> make more sense to me to permit the user to define the comparison as part
>> of the MAX function if we are to offer it. However, with the
>> lexicographical definition we have I am somewhat less concerned for Set and
>> List. Map seems like a real problem though, if we support these operators
>> (which perhaps we do not).
>>
>>
>> On 7 Dec 2022, at 12:13, Andrés de la Peña <[email protected]> wrote:
>>
>> 
>> The comparator for collections is the lexicographical compare on the
>> collection items. That might nor be the more useful thing but it's not
>> impossible to imagine cases where that ordering can be useful. To make a
>> random example, you can use a list column to store the name and surnames of
>> a person, considering that some persons can have multiple surnames. You can
>> then sort rows based on that list column, to get the names in alphabetical
>> order, or to get the first or last person according to that order. I'm sure
>> we can think on more cases were the lexicographical order of a list can be
>> useful, although I agree that's is not the greatest feature ever.
>>
>> It's worth mentioning that collections are not the only data types where
>> the MIN/MAX functions are of dubious utility. For example, blob columns can
>> also be used with MIN/MAX. Same as with collections, the min/max blobs are
>> selected according to the comparator for the data type. That comparator is
>> the lexicographic compare on the unsigned values of the byte contents.
>>
>> The utility of MIN/MAX on inet and boolean columns isn't very clear
>> either, although one can always imagine use cases. Fox example, MAX of a
>> boolean column can be used as a logical disjunction.
>>
>> If we were to special-case MIN/MAX functions to reject collections, we
>> should also reject other data types such as, at least, blobs. That would
>> require a deprecation plan.
>>
>> Also, it's not that the comparator used by MIN/MAX is an internal obscure
>> thing. The action of that comparator is very visible when any of those data
>> types is used in a clustering column, and it's used as the basis for "ORDER
>> BY" clauses. Should we also reject blobs, collections, tuples and UDTs on
>> "ORDER BY"? I don't think so.
>>
>> I rather think that basing MIN/MAX on the regular order of the column
>> data type is consistent, easy to do and easy to understand.
>>
>> I don't see the need to add rules explicitly forbidding some data types
>> on MIN/MAX functions just because we can't easily figure out a use case for
>> their ordering. Especially when we are exposing that same ordering on
>> clusterings and "ORDER BY".
>>
>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <[email protected]>
>> wrote:
>>
>>> If the functionality truly has never actually worked, then throwing an
>>> error that MAX is not supported for collections seems reasonable.
>>>
>>> But we should throw an error, I do not think we should have functions
>>> that aggregate across rows and functions that operate within a row use the
>>> same name.
>>>
>>> My expectation as a user would be that MAX either always aggregates
>>> across rows, so results in a single row of output or always operates within
>>> a row, so returns the full set of rows matching the query.
>>>
>>> So if we want a max that aggregates across rows that works for
>>> collections we could change it to return the aggregated max across all
>>> rows. Or we just leave it as an error and if someone wants the max across
>>> all rows they would ask for MAX(COLLECTION_MAX(column)). Yes I still agree
>>> COLLECTION_MAX may be a bad name.
>>>
>>> > On Dec 6, 2022, at 11:55 AM, Benedict <[email protected]> wrote:
>>> >
>>> > As far as I am aware it has never worked in a release, and so
>>> deprecating it is probably not as challenging as you think. Only folk that
>>> have been able to parse the raw bytes of the collection in storage format
>>> would be affected - which we can probably treat as zero.
>>> >
>>> >
>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan <[email protected]>
>>> wrote:
>>> >>
>>> >> 
>>> >>>
>>> >>> 1. I think it is a mistake to offer a function MAX that operates
>>> over rows containing collections, returning the collection with the most
>>> elements. This is just a nonsensical operation to support IMO. We should
>>> decide as a community whether we “fix” this aggregation, or remove it.
>>> >>
>>> >> The current MAX function does not work this way afaik?  It returns
>>> the row with the column that has the highest value in clustering order
>>> sense, like if the collection was used as a clustering key.  While that
>>> also may have limited use, I don’t think it worth while to deprecate such
>>> use and all the headache that comes with doing so.
>>> >>
>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for
>>> discovery, and all-else equal it would be better to use MAX,MIN, etc, same
>>> as for aggregations.
>>> >>
>>> >> If we actually wanted to move towards using the existing names with
>>> new meanings, then I think that would take us multiple major releases.
>>> First deprecate existing use in current releases.  Then make it an error in
>>> the next major release X.  Then change the behavior in major release X+1.
>>> Just switching the behavior without having a major where such queries error
>>> out would make a bunch of user queries start returning “wrong” data.
>>> >> Also I don’t think those functions being cross row aggregations for
>>> some column types, but within row collection operations for other types, is
>>> any more intuitive, and actually would be more confusing.  So I am -1 on
>>> using the same names.
>>> >>
>>> >>> 3. I think it is peculiar to permit methods named collection_ to
>>> operate over non-collection types when they are explicitly collection
>>> variants.
>>> >>
>>> >> While I could see some point to this, I do not think it would be
>>> confusing for something named collection_XXX to treat a non-collection as a
>>> collection of 1.  But maybe there is a better name for these function.
>>> Rather than seeing them as collection variants, we should see them as
>>> variants that operate on the data in a single row, rather than aggregating
>>> across multiple rows.  But even with that perspective I don’t know what the
>>> best name would be.
>>> >>
>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict <[email protected]> wrote:
>>> >>>
>>> >>> Thanks Andres, I think community input on direction here will be
>>> invaluable. There’s a bunch of interrelated tickets, and my opinions are as
>>> follows:
>>> >>>
>>> >>> 1. I think it is a mistake to offer a function MAX that operates
>>> over rows containing collections, returning the collection with the most
>>> elements. This is just a nonsensical operation to support IMO. We should
>>> decide as a community whether we “fix” this aggregation, or remove it.
>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for
>>> discovery, and all-else equal it would be better to use MAX,MIN, etc, same
>>> as for aggregations.
>>> >>> 3. I think it is peculiar to permit methods named collection_ to
>>> operate over non-collection types when they are explicitly collection
>>> variants.
>>> >>>
>>> >>> Given (1), (2) becomes simple except for COUNT which remains
>>> ambiguous, but this could be solved by either providing a separate method
>>> for collections (e.g. SIZE) which seems fine to me, or by offering a
>>> precedence order for matching and a keyword for overriding the precedence
>>> order (e.g. COUNT(collection AS COLLECTION)).
>>> >>>
>>> >>> Given (2), (3) is a little more difficult. However, I think this can
>>> be solved several ways.
>>> >>> - We could permit explicit casts to collection types, that for a
>>> collection type would be a no-op, and for a single value would create a
>>> collection
>>> >>> - With precedence orders, by always selecting the scalar function
>>> last
>>> >>> - By permitting WRITETIME to accept a binary operator reduce
>>> function to resolve multiple values
>>> >>>
>>> >>> These decisions all imply trade-offs on each other, and affect the
>>> evolution of CQL, so I think community input would be helpful.
>>> >>>
>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña <[email protected]>
>>> wrote:
>>> >>>>
>>> >>>> 
>>> >>>> This will require some long introduction for context:
>>> >>>>
>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max
>>> column value according to their comparator. For collections, the comparison
>>> is on the lexicographical order of the collection elements. That's the very
>>> same comparator that is used when collections are used as clustering keys
>>> and for ORDER BY.
>>> >>>>
>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make that
>>> the results were presented in their unserialized form, although the row
>>> selection was correct. That bug was recently solved by CASSANDRA-17811.
>>> During that ticket it was also considered the option of simply disabling
>>> MIN/MAX on collection since applying those functions to collections, since
>>> they don't seem super useful. However, that option was quickly discarded
>>> and the operation was fixed so the MIN/MAX functions correctly work for
>>> every data type.
>>> >>>>
>>> >>>> As a byproduct of the internal improvements of that fix,
>>> CASSANDRA-8877 introduced a new set of functions that can perform
>>> aggregations of the elements of a collection. Those where named "map_keys",
>>> "map_values", "collection_min", "collection_max", "collection_sum", and
>>> "collection_count". Those are the names mentioned on the mail list thread
>>> about function naming conventions. Despite doing a kind of
>>> within-collection aggregation, these functions are not what we usually call
>>> aggregate functions, since they don't aggregate multiple rows together.
>>> >>>>
>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a
>>> MAXWRITETIME function to get the max timestamp of a multi-cell column.
>>> However, the new collection functions can be used in combination with the
>>> WRITETIME and TTL functions to retrieve the min/max/sum/avg timestamp or
>>> ttl of a multi-cell column. Since the new functions give a generic way of
>>> aggreagting timestamps ant TTLs of multi-cell columns, CASSANDRA-18078
>>> proposed to remove that MAXWRITETIME function.
>>> >>>>
>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove
>>> that function in favour of the new generic collection functions. However,
>>> the MAXWRITETIME function can work on both single-cell and multi-cell
>>> columns, whereas "COLLECTION_MAX(WRITETIME(column))" would only work on
>>> multi-cell columns, That's because MAXWRITETIME of a not-multicell column
>>> doesn't return a collection, and one should simply use "WRITETIME(column)"
>>> instead. So it was proposed in CASSANDRA-18037 that collections functions
>>> applied to a not-collection value consider that value as the only element
>>> of a singleton collection. So, for example, COLLECTION_MAX(7) =
>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
>>> mostly ready to commit.
>>> >>>>
>>> >>>> Now we can go straight to the point:
>>> >>>>
>>> >>>> Recently Benedict brought back the idea of deprecating aggregate
>>> functions applied to collections, the very same idea that was mentioned on
>>> CASSANDRA-17811 description almost four months ago. That way we could
>>> rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
>>> aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
>>> function when applied to not-collection columns, and a scalar function when
>>> applied to collection. We can't do that with COUNT because there would be
>>> an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
>>> SIZE. Benedict, please correct me if I'm not correctly exposing the
>>> proposal.
>>> >>>>
>>> >>>> I however would prefer to keep aggregate functions working on
>>> collections, and keep the names of the new collection functions as
>>> "COLLECTION_*". Reasons are:
>>> >>>>
>>> >>>> 1 - Making aggregate functions not work on collections might be
>>> cosidered as breaking backward compatibility and require a deprecation
>>> plan.
>>> >>>> 2 - Keeping aggregate functions working on collections might not
>>> look superuseful, but they make the set of aggregate functions consistent
>>> and applicable to every column type.
>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions
>>> establishes a clear distinction between row aggregations and collection
>>> aggregations, while at the same time exposing the analogy between each pair
>>> of functions.
>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for
>>> workarounds such as using the column type when possible, or trying to
>>> figure out synonyms like in the case of COUNT/SIZE. Even if that works for
>>> this case, future functions can find more trouble when trying to figure out
>>> workarounds to avoid clashing with existing function names. For example, we
>>> might want to add a SIZE function that gets the size in bytes of any
>>> column, or we might want to add a MAX function that gets the maximum of a
>>> set of columns, etc. And example of the synonym-based approach that comes
>>> to mind is MySQL's MAX and GREATEST functions, where MAX is for row
>>> aggregation and GREATEST is for column aggregation.
>>> >>>> 5 - If MIN/MAX function selection is based on the column type, we
>>> can't implement Yifan's proposal of making COLLECTION_MAX(7) =
>>> COLLECTION_MAX([7]) = 7, which would be very useful for combining
>>> collection functions with time functions.
>>> >>>>
>>> >>>> What do others think? What should we do with aggregate functions on
>>> collections, collection functions and MAXWRITETIME?
>>> >>
>>>
>>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to