Aggregate functions on collections, collection functions and MAXWRITETIME

Andrés de la Peña Tue, 06 Dec 2022 04:44:13 -0800

This will require some long introduction for context:

The MAX/MIN functions aggregate rows to get the row with min/max column
value according to their comparator. For collections, the comparison is on
the lexicographical order of the collection elements. That's the very same
comparator that is used when collections are used as clustering keys and
for ORDER BY.


However, a bug in the MIN/MAX aggregate functions used to make that the
results were presented in their unserialized form, although the row
selection was correct. That bug was recently solved by CASSANDRA-17811.
During that ticket it was also considered the option of simply disabling
MIN/MAX on collection since applying those functions to collections, since
they don't seem super useful. However, that option was quickly discarded
and the operation was fixed so the MIN/MAX functions correctly work for
every data type.

As a byproduct of the internal improvements of that fix, CASSANDRA-8877
introduced a new set of functions that can perform aggregations of the
elements of a collection. Those where named "map_keys", "map_values",
"collection_min", "collection_max", "collection_sum", and
"collection_count". Those are the names mentioned on the mail list thread
about function naming conventions. Despite doing a kind of
within-collection aggregation, these functions are not what we usually call
aggregate functions, since they don't aggregate multiple rows together.

On a different line of work, CASSANDRA-17425 added to trunk a MAXWRITETIME
function to get the max timestamp of a multi-cell column. However, the new
collection functions can be used in combination with the WRITETIME and TTL
functions to retrieve the min/max/sum/avg timestamp or ttl of a multi-cell
column. Since the new functions give a generic way of aggreagting
timestamps ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to
remove that MAXWRITETIME function.

Yifan Cai, author of the MAXWRITETIME function, agreed to remove that
function in favour of the new generic collection functions. However, the
MAXWRITETIME function can work on both single-cell and multi-cell columns,
whereas "COLLECTION_MAX(WRITETIME(column))" would only work on multi-cell
columns, That's because MAXWRITETIME of a not-multicell column doesn't
return a collection, and one should simply use "WRITETIME(column)" instead.
So it was proposed in CASSANDRA-18037 that collections functions applied to
a not-collection value consider that value as the only element of a
singleton collection. So, for example, COLLECTION_MAX(7) =
COLLECTION_MAX([7]) = 7. That ticket has already been reviewed and it's
mostly ready to commit.

Now we can go straight to the point:

Recently Benedict brought back the idea of deprecating aggregate functions
applied to collections, the very same idea that was mentioned on
CASSANDRA-17811 description almost four months ago. That way we could
rename the new collection functions MIN/MAX/SUM/AVG, same as the classic
aggregate functions. That way MIN/MAX/SUM/AVG would be an aggregate
function when applied to not-collection columns, and a scalar function when
applied to collection. We can't do that with COUNT because there would be
an ambiguity, so the proposal for that case is renaming COLLECTION_COUNT to
SIZE. Benedict, please correct me if I'm not correctly exposing the
proposal.

I however would prefer to keep aggregate functions working on collections,
and keep the names of the new collection functions as "COLLECTION_*".
Reasons are:

1 - Making aggregate functions not work on collections might be cosidered
as breaking backward compatibility and require a deprecation plan.
2 - Keeping aggregate functions working on collections might not look
superuseful, but they make the set of aggregate functions consistent and
applicable to every column type.
3 - Using the "COLLECTION_" prefix on collection functions establishes a
clear distinction between row aggregations and collection aggregations,
while at the same time exposing the analogy between each pair of functions.
4 - Not using the "COLLECTION_" prefix forces us to search for workarounds
such as using the column type when possible, or trying to figure out
synonyms like in the case of COUNT/SIZE. Even if that works for this case,
future functions can find more trouble when trying to figure out
workarounds to avoid clashing with existing function names. For example, we
might want to add a SIZE function that gets the size in bytes of any
column, or we might want to add a MAX function that gets the maximum of a
set of columns, etc. And example of the synonym-based approach that comes
to mind is MySQL's MAX and GREATEST functions, where MAX is for row
aggregation and GREATEST is for column aggregation.
5 - If MIN/MAX function selection is based on the column type, we can't
implement Yifan's proposal of making COLLECTION_MAX(7) =
COLLECTION_MAX([7]) = 7, which would be very useful for combining
collection functions with time functions.

What do others think? What should we do with aggregate functions on
collections, collection functions and MAXWRITETIME?

Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to