It feels like this is a recurring kind of discussion, and I wonder if there’s any value in deciding on a general approach to guide these discussions in future? Are we aiming to look like SQL as we evolve, and if so which products do we want to be informed by?
I expect we’ll rehash it every API thread otherwise. > On 8 Dec 2022, at 17:37, Benedict <bened...@apache.org> wrote: > > > >>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >> Yes, a quick search on Google shows some examples: >> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg >> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g >> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum >> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html > > We seem to be mixing and matching our databases we use as templates here. > Most of the standard examples we use (Postgres, MySQL etc) do not offer this, > and do not offer equivalent functions on any of their other collection types. > > If we are to use industry norms as inspiration or justification, subqueries > do seem to be the approach taken. So the question is really why we should > introduce non-standard variants of fairly atypical functions instead of the > more typical approach of subqueries? Because it’s more work, or some other > reason? We should be explicit. > > If we aren’t using industry norms as our justification, we should probably > jettison the lot as honestly the landscape is a mess in this area. > >> >>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >> Yes, again we can search for some examples: >> https://docs.snowflake.com/en/sql-reference/functions/array_size.html >> https://docs.databricks.com/sql/language-manual/functions/array_size.html > > My point here was that none I could find use ARRAY_COUNT - either ARRAY_SIZE > or ARRAY_LENGTH > >>> 3) A map is a collection in C* parlance, but I gather from below you expect >>> these methods not to operate on them? >> Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN >> and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG >> require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG >> functions above require a numeric array. > > Should collection_max only work on list and set when we can call MAX(map) > today (at least I think we can)? Whatever we decide our aggregate behaviour > is should be consistent here. > >>> Does ARRAY_MAX operate on single values? If we are to base our decisions on >>> norms elsewhere, we should be consistent about it. >> It doesn't in any the examples above. Those functions aren't a standard so I >> don't know if there are others dbs around that support it. In any case, the >> fact that we look for inspiration in other databases to minimize surprise >> etc. doesn't mean that we have to do exactly the same. After all, CQL is not >> SQL and our collections aren't SQL arrays. >> >> Note that the collection functions added by CASSANDRA-8877 don't operate on >> single values either. That idea was proposed by Yifan on CASSANDRA-18078 and >> it looked good to Francisco and me. The patch is on CASSANDRA-18085, already >> reviewed and blocked waiting on the outcome of this discussion. >> >> The new collection functions can do the same as the new MAXWRITE function, >> but not only for getting max timestamps, but also min timestamps and min/max >> ttls. The missing part is that MAXWRITE can accept both collections and >> single elements, so callers don't need to know the type of the column. >> That's the motivation behind the idea of doing the same with the collection >> functions, so they can entirely replace MAXWRITE. >> >> However I wouldn't be against leaving the collection functions working only >> on collections, as originally designed, and as they currently are on trunk. >> The question is what we do with MAXWRITETIME. That function is also only on >> trunk, and it might be repetitive given the more generic collection >> functions. It's also a bit odd that there isn't, for example, a similar >> MINTTL function. Maybe we should start a separate discussion thread about >> that new function? > > I think we should figure out our overall strategy - these are all pieces of > the puzzle IMO. But I guess the above questions seem to come first and will > shape this. I would be in favour of some general approach, however, such as > either first casting to a collection, or passing an aggregation operator to > WRITETIME. > > > >>> On 8 Dec 2022, at 17:13, Andrés de la Peña <adelap...@apache.org> wrote: >>> >> >>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >> Yes, a quick search on Google shows some examples: >> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg >> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g >> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum >> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html >> >>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >> Yes, again we can search for some examples: >> https://docs.snowflake.com/en/sql-reference/functions/array_size.html >> https://docs.databricks.com/sql/language-manual/functions/array_size.html >> >>> 3) A map is a collection in C* parlance, but I gather from below you expect >>> these methods not to operate on them? >> Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN >> and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG >> require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG >> functions above require a numeric array. >> >>> Does ARRAY_MAX operate on single values? If we are to base our decisions on >>> norms elsewhere, we should be consistent about it. >> It doesn't in any the examples above. Those functions aren't a standard so I >> don't know if there are others dbs around that support it. In any case, the >> fact that we look for inspiration in other databases to minimize surprise >> etc. doesn't mean that we have to do exactly the same. After all, CQL is not >> SQL and our collections aren't SQL arrays. >> >> Note that the collection functions added by CASSANDRA-8877 don't operate on >> single values either. That idea was proposed by Yifan on CASSANDRA-18078 and >> it looked good to Francisco and me. The patch is on CASSANDRA-18085, already >> reviewed and blocked waiting on the outcome of this discussion. >> >> The new collection functions can do the same as the new MAXWRITE function, >> but not only for getting max timestamps, but also min timestamps and min/max >> ttls. The missing part is that MAXWRITE can accept both collections and >> single elements, so callers don't need to know the type of the column. >> That's the motivation behind the idea of doing the same with the collection >> functions, so they can entirely replace MAXWRITE. >> >> However I wouldn't be against leaving the collection functions working only >> on collections, as originally designed, and as they currently are on trunk. >> The question is what we do with MAXWRITETIME. That function is also only on >> trunk, and it might be repetitive given the more generic collection >> functions. It's also a bit odd that there isn't, for example, a similar >> MINTTL function. Maybe we should start a separate discussion thread about >> that new function? >> >> >> >>> On Thu, 8 Dec 2022 at 14:21, Benedict <bened...@apache.org> wrote: >>> 1) Do they offer ARRAY_SUM or ARRAY_AVG? >>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH? >>> 3) A map is a collection in C* parlance, but I gather from below you expect >>> these methods not to operate on them? >>> >>> Does ARRAY_MAX operate on single values? If we are to base our decisions on >>> norms elsewhere, we should be consistent about it. >>> >>> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. Some >>> databases also have Set or Map types, such as MySQL’s Set and Postgres’ >>> hstore. These databases only support ARRAY_ functions, seemingly, plus >>> special MULTISET operators defined by the SQL standard where that data type >>> is supported. >>> >>> >>> >>>>> On 8 Dec 2022, at 12:11, Andrés de la Peña <adelap...@apache.org> wrote: >>>>> >>>> >>>> "ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in a list >>>> are not an uncommon practice. You can find them in SparkSQL, Amazon >>>> Timestream, Teradata, etc. Since we have what we call collections instead >>>> or arrays, it makes sense to call the analogous functions >>>> "COLLECTION_MAX", "COLLECTION_MIN", etc. >>>> >>>> As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and "MAP_VALUES" >>>> functions to get the keys or the values of a map, so one can feed them to >>>> "MAX", "COLLECTION_MAX", etc. That isn't anything too original either, you >>>> can find identical functions on SparkSQL for example. >>>> >>>> I find simple utility functions easier to use than subqueries. But we >>>> don't have to chose. We can also have subqueries if someone finds the time >>>> to work on them. >>>> >>>>> On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev >>>>> <dev@cassandra.apache.org> wrote: >>>>> I think the semantics of the situation is important here. >>>>> >>>>> Let’s take MAX as our example aggregate function.. >>>>> >>>>> We all expect that in a DB context MAX(column) will return the value of >>>>> the column with the maximum value. That is the expected semantics of MAX. >>>>> >>>>> The question here is that there are data types that are multi-valued and >>>>> there is a desire to apply MAX to the values within the column. I would >>>>> expect that this would return the maximum value of the column every row >>>>> in the DB. >>>>> >>>>> So if there were a keyword that operated like the Java BiFunction class >>>>> where the Function would apply a second function to the column data. For >>>>> purposes of this discussion let’s call this Function APPLY. >>>>> >>>>> So APPLY( MAX, column ) would return the maximum value from the column >>>>> for each row in the DB. >>>>> >>>>> MAX(APPLY(MAX,column)) would get the maximum value from the column across >>>>> all the rows. >>>>> >>>>> Similarly APPLY could be used with other functions MAX(APPLY(MIN,column)) >>>>> the largest minimum value from the column across all rows. >>>>> >>>>> These statements make clear semantically what is being asked for. >>>>> >>>>>> On Thu, Dec 8, 2022 at 10:57 AM Benedict <bened...@apache.org> wrote: >>>>>> I meant unnest, not unwrap. >>>>>> >>>>>>>> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote: >>>>>>>> >>>>>>> >>>>>>> >>>>>>>> I do not think we should have functions that aggregate across rows and >>>>>>>> functions that operate within a row use the same name. >>>>>>> >>>>>>> I’m sympathetic to that view for sure. I wouldn’t be too disappointed >>>>>>> by that outcome, and SQL engines seem to take a similar approach, >>>>>>> however they mostly rely on sub-queries to get around this problem, and >>>>>>> the SQL standard introduces UNWRAP for operating on arrays (by >>>>>>> translating them into a table), permitting subqueries to aggregate >>>>>>> them. It seems to me we have four options: >>>>>>> >>>>>>> 1) introduce functionality similar to UNWRAP and subqueries >>>>>>> 2) introduce new syntax to permit operating on collections with the >>>>>>> same functions >>>>>>> 3) permit the same functions to operate on both, with a precedence >>>>>>> order, and introduce syntax to permit breaking the precedence order >>>>>>> 4) introduce new functions >>>>>>> >>>>>>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS >>>>>>> max_item FROM table >>>>>>> >>>>>>> (2) and (3) might look something like: >>>>>>> >>>>>>> SELECT MAX(list AS COLLECTION) or >>>>>>> SELECT MAX(list AS ROWS) >>>>>>> >>>>>>> (4) might look something like we have already, but perhaps with >>>>>>> different names >>>>>>> >>>>>>>> The comparator for collections is the lexicographical compare on the >>>>>>>> collection items >>>>>>> >>>>>>> This is a fair point, I mistakenly thought it sorted first on size. >>>>>>> Even this definition is a little funkier for Map types, where the >>>>>>> values of a key may cause something to sort earlier than a map whose >>>>>>> next key sorts first. There are multiple potential lexicographical >>>>>>> sorts for Maps (i.e., by keys first, then values, or by (key, value) >>>>>>> pairs), so this is particularly poorly defined IMO. >>>>>>> >>>>>>> The maximum of a blob type is pretty well defined I think, as are >>>>>>> boolean, inetaddress etc. However, even for List or Set collections >>>>>>> there’s multiple reasonable functions one could define for maximum, so >>>>>>> it would make more sense to me to permit the user to define the >>>>>>> comparison as part of the MAX function if we are to offer it. However, >>>>>>> with the lexicographical definition we have I am somewhat less >>>>>>> concerned for Set and List. Map seems like a real problem though, if we >>>>>>> support these operators (which perhaps we do not). >>>>>>> >>>>>>> >>>>>>>>> On 7 Dec 2022, at 12:13, Andrés de la Peña <adelap...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>> >>>>>>>> The comparator for collections is the lexicographical compare on the >>>>>>>> collection items. That might nor be the more useful thing but it's not >>>>>>>> impossible to imagine cases where that ordering can be useful. To make >>>>>>>> a random example, you can use a list column to store the name and >>>>>>>> surnames of a person, considering that some persons can have multiple >>>>>>>> surnames. You can then sort rows based on that list column, to get the >>>>>>>> names in alphabetical order, or to get the first or last person >>>>>>>> according to that order. I'm sure we can think on more cases were the >>>>>>>> lexicographical order of a list can be useful, although I agree that's >>>>>>>> is not the greatest feature ever. >>>>>>>> >>>>>>>> It's worth mentioning that collections are not the only data types >>>>>>>> where the MIN/MAX functions are of dubious utility. For example, blob >>>>>>>> columns can also be used with MIN/MAX. Same as with collections, the >>>>>>>> min/max blobs are selected according to the comparator for the data >>>>>>>> type. That comparator is the lexicographic compare on the unsigned >>>>>>>> values of the byte contents. >>>>>>>> >>>>>>>> The utility of MIN/MAX on inet and boolean columns isn't very clear >>>>>>>> either, although one can always imagine use cases. Fox example, MAX of >>>>>>>> a boolean column can be used as a logical disjunction. >>>>>>>> >>>>>>>> If we were to special-case MIN/MAX functions to reject collections, we >>>>>>>> should also reject other data types such as, at least, blobs. That >>>>>>>> would require a deprecation plan. >>>>>>>> >>>>>>>> Also, it's not that the comparator used by MIN/MAX is an internal >>>>>>>> obscure thing. The action of that comparator is very visible when any >>>>>>>> of those data types is used in a clustering column, and it's used as >>>>>>>> the basis for "ORDER BY" clauses. Should we also reject blobs, >>>>>>>> collections, tuples and UDTs on "ORDER BY"? I don't think so. >>>>>>>> >>>>>>>> I rather think that basing MIN/MAX on the regular order of the column >>>>>>>> data type is consistent, easy to do and easy to understand. >>>>>>>> >>>>>>>> I don't see the need to add rules explicitly forbidding some data >>>>>>>> types on MIN/MAX functions just because we can't easily figure out a >>>>>>>> use case for their ordering. Especially when we are exposing that same >>>>>>>> ordering on clusterings and "ORDER BY". >>>>>>>> >>>>>>>>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <jeremiah.jor...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> If the functionality truly has never actually worked, then throwing >>>>>>>>> an error that MAX is not supported for collections seems reasonable. >>>>>>>>> >>>>>>>>> But we should throw an error, I do not think we should have functions >>>>>>>>> that aggregate across rows and functions that operate within a row >>>>>>>>> use the same name. >>>>>>>>> >>>>>>>>> My expectation as a user would be that MAX either always aggregates >>>>>>>>> across rows, so results in a single row of output or always operates >>>>>>>>> within a row, so returns the full set of rows matching the query. >>>>>>>>> >>>>>>>>> So if we want a max that aggregates across rows that works for >>>>>>>>> collections we could change it to return the aggregated max across >>>>>>>>> all rows. Or we just leave it as an error and if someone wants the >>>>>>>>> max across all rows they would ask for MAX(COLLECTION_MAX(column)). >>>>>>>>> Yes I still agree COLLECTION_MAX may be a bad name. >>>>>>>>> >>>>>>>>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> wrote: >>>>>>>>> > >>>>>>>>> > As far as I am aware it has never worked in a release, and so >>>>>>>>> > deprecating it is probably not as challenging as you think. Only >>>>>>>>> > folk that have been able to parse the raw bytes of the collection >>>>>>>>> > in storage format would be affected - which we can probably treat >>>>>>>>> > as zero. >>>>>>>>> > >>>>>>>>> > >>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan >>>>>>>>> >> <jeremiah.jor...@gmail.com> wrote: >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >>> >>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates >>>>>>>>> >>> over rows containing collections, returning the collection with >>>>>>>>> >>> the most elements. This is just a nonsensical operation to >>>>>>>>> >>> support IMO. We should decide as a community whether we “fix” >>>>>>>>> >>> this aggregation, or remove it. >>>>>>>>> >> >>>>>>>>> >> The current MAX function does not work this way afaik? It returns >>>>>>>>> >> the row with the column that has the highest value in clustering >>>>>>>>> >> order sense, like if the collection was used as a clustering key. >>>>>>>>> >> While that also may have limited use, I don’t think it worth while >>>>>>>>> >> to deprecate such use and all the headache that comes with doing >>>>>>>>> >> so. >>>>>>>>> >> >>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for >>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, >>>>>>>>> >>> etc, same as for aggregations. >>>>>>>>> >> >>>>>>>>> >> If we actually wanted to move towards using the existing names >>>>>>>>> >> with new meanings, then I think that would take us multiple major >>>>>>>>> >> releases. First deprecate existing use in current releases. Then >>>>>>>>> >> make it an error in the next major release X. Then change the >>>>>>>>> >> behavior in major release X+1. Just switching the behavior >>>>>>>>> >> without having a major where such queries error out would make a >>>>>>>>> >> bunch of user queries start returning “wrong” data. >>>>>>>>> >> Also I don’t think those functions being cross row aggregations >>>>>>>>> >> for some column types, but within row collection operations for >>>>>>>>> >> other types, is any more intuitive, and actually would be more >>>>>>>>> >> confusing. So I am -1 on using the same names. >>>>>>>>> >> >>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to >>>>>>>>> >>> operate over non-collection types when they are explicitly >>>>>>>>> >>> collection variants. >>>>>>>>> >> >>>>>>>>> >> While I could see some point to this, I do not think it would be >>>>>>>>> >> confusing for something named collection_XXX to treat a >>>>>>>>> >> non-collection as a collection of 1. But maybe there is a better >>>>>>>>> >> name for these function. Rather than seeing them as collection >>>>>>>>> >> variants, we should see them as variants that operate on the data >>>>>>>>> >> in a single row, rather than aggregating across multiple rows. >>>>>>>>> >> But even with that perspective I don’t know what the best name >>>>>>>>> >> would be. >>>>>>>>> >> >>>>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict <bened...@apache.org> wrote: >>>>>>>>> >>> >>>>>>>>> >>> Thanks Andres, I think community input on direction here will be >>>>>>>>> >>> invaluable. There’s a bunch of interrelated tickets, and my >>>>>>>>> >>> opinions are as follows: >>>>>>>>> >>> >>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates >>>>>>>>> >>> over rows containing collections, returning the collection with >>>>>>>>> >>> the most elements. This is just a nonsensical operation to >>>>>>>>> >>> support IMO. We should decide as a community whether we “fix” >>>>>>>>> >>> this aggregation, or remove it. >>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for >>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, >>>>>>>>> >>> etc, same as for aggregations. >>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to >>>>>>>>> >>> operate over non-collection types when they are explicitly >>>>>>>>> >>> collection variants. >>>>>>>>> >>> >>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains >>>>>>>>> >>> ambiguous, but this could be solved by either providing a >>>>>>>>> >>> separate method for collections (e.g. SIZE) which seems fine to >>>>>>>>> >>> me, or by offering a precedence order for matching and a keyword >>>>>>>>> >>> for overriding the precedence order (e.g. COUNT(collection AS >>>>>>>>> >>> COLLECTION)). >>>>>>>>> >>> >>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this >>>>>>>>> >>> can be solved several ways. >>>>>>>>> >>> - We could permit explicit casts to collection types, that for a >>>>>>>>> >>> collection type would be a no-op, and for a single value would >>>>>>>>> >>> create a collection >>>>>>>>> >>> - With precedence orders, by always selecting the scalar function >>>>>>>>> >>> last >>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce >>>>>>>>> >>> function to resolve multiple values >>>>>>>>> >>> >>>>>>>>> >>> These decisions all imply trade-offs on each other, and affect >>>>>>>>> >>> the evolution of CQL, so I think community input would be helpful. >>>>>>>>> >>> >>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña >>>>>>>>> >>>>> <adelap...@apache.org> wrote: >>>>>>>>> >>>> >>>>>>>>> >>>> >>>>>>>>> >>>> This will require some long introduction for context: >>>>>>>>> >>>> >>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max >>>>>>>>> >>>> column value according to their comparator. For collections, the >>>>>>>>> >>>> comparison is on the lexicographical order of the collection >>>>>>>>> >>>> elements. That's the very same comparator that is used when >>>>>>>>> >>>> collections are used as clustering keys and for ORDER BY. >>>>>>>>> >>>> >>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make >>>>>>>>> >>>> that the results were presented in their unserialized form, >>>>>>>>> >>>> although the row selection was correct. That bug was recently >>>>>>>>> >>>> solved by CASSANDRA-17811. During that ticket it was also >>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on collection >>>>>>>>> >>>> since applying those functions to collections, since they don't >>>>>>>>> >>>> seem super useful. However, that option was quickly discarded >>>>>>>>> >>>> and the operation was fixed so the MIN/MAX functions correctly >>>>>>>>> >>>> work for every data type. >>>>>>>>> >>>> >>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, >>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can >>>>>>>>> >>>> perform aggregations of the elements of a collection. Those >>>>>>>>> >>>> where named "map_keys", "map_values", "collection_min", >>>>>>>>> >>>> "collection_max", "collection_sum", and "collection_count". >>>>>>>>> >>>> Those are the names mentioned on the mail list thread about >>>>>>>>> >>>> function naming conventions. Despite doing a kind of >>>>>>>>> >>>> within-collection aggregation, these functions are not what we >>>>>>>>> >>>> usually call aggregate functions, since they don't aggregate >>>>>>>>> >>>> multiple rows together. >>>>>>>>> >>>> >>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a >>>>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell >>>>>>>>> >>>> column. However, the new collection functions can be used in >>>>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the >>>>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since >>>>>>>>> >>>> the new functions give a generic way of aggreagting timestamps >>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to >>>>>>>>> >>>> remove that MAXWRITETIME function. >>>>>>>>> >>>> >>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove >>>>>>>>> >>>> that function in favour of the new generic collection functions. >>>>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell >>>>>>>>> >>>> and multi-cell columns, whereas >>>>>>>>> >>>> "COLLECTION_MAX(WRITETIME(column))" would only work on >>>>>>>>> >>>> multi-cell columns, That's because MAXWRITETIME of a >>>>>>>>> >>>> not-multicell column doesn't return a collection, and one should >>>>>>>>> >>>> simply use "WRITETIME(column)" instead. So it was proposed in >>>>>>>>> >>>> CASSANDRA-18037 that collections functions applied to a >>>>>>>>> >>>> not-collection value consider that value as the only element of >>>>>>>>> >>>> a singleton collection. So, for example, COLLECTION_MAX(7) = >>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed >>>>>>>>> >>>> and it's mostly ready to commit. >>>>>>>>> >>>> >>>>>>>>> >>>> Now we can go straight to the point: >>>>>>>>> >>>> >>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate >>>>>>>>> >>>> functions applied to collections, the very same idea that was >>>>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. >>>>>>>>> >>>> That way we could rename the new collection functions >>>>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That >>>>>>>>> >>>> way MIN/MAX/SUM/AVG would be an aggregate function when applied >>>>>>>>> >>>> to not-collection columns, and a scalar function when applied to >>>>>>>>> >>>> collection. We can't do that with COUNT because there would be >>>>>>>>> >>>> an ambiguity, so the proposal for that case is renaming >>>>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not >>>>>>>>> >>>> correctly exposing the proposal. >>>>>>>>> >>>> >>>>>>>>> >>>> I however would prefer to keep aggregate functions working on >>>>>>>>> >>>> collections, and keep the names of the new collection functions >>>>>>>>> >>>> as "COLLECTION_*". Reasons are: >>>>>>>>> >>>> >>>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be >>>>>>>>> >>>> cosidered as breaking backward compatibility and require a >>>>>>>>> >>>> deprecation plan. >>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not >>>>>>>>> >>>> look superuseful, but they make the set of aggregate functions >>>>>>>>> >>>> consistent and applicable to every column type. >>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions >>>>>>>>> >>>> establishes a clear distinction between row aggregations and >>>>>>>>> >>>> collection aggregations, while at the same time exposing the >>>>>>>>> >>>> analogy between each pair of functions. >>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for >>>>>>>>> >>>> workarounds such as using the column type when possible, or >>>>>>>>> >>>> trying to figure out synonyms like in the case of COUNT/SIZE. >>>>>>>>> >>>> Even if that works for this case, future functions can find more >>>>>>>>> >>>> trouble when trying to figure out workarounds to avoid clashing >>>>>>>>> >>>> with existing function names. For example, we might want to add >>>>>>>>> >>>> a SIZE function that gets the size in bytes of any column, or we >>>>>>>>> >>>> might want to add a MAX function that gets the maximum of a set >>>>>>>>> >>>> of columns, etc. And example of the synonym-based approach that >>>>>>>>> >>>> comes to mind is MySQL's MAX and GREATEST functions, where MAX >>>>>>>>> >>>> is for row aggregation and GREATEST is for column aggregation. >>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, >>>>>>>>> >>>> we can't implement Yifan's proposal of making COLLECTION_MAX(7) >>>>>>>>> >>>> = COLLECTION_MAX([7]) = 7, which would be very useful for >>>>>>>>> >>>> combining collection functions with time functions. >>>>>>>>> >>>> >>>>>>>>> >>>> What do others think? What should we do with aggregate functions >>>>>>>>> >>>> on collections, collection functions and MAXWRITETIME? >>>>>>>>> >>