Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Benedict Thu, 08 Dec 2022 10:21:32 -0800

It feels like this is a recurring kind of discussion, and I wonder if there’s 
any value in deciding on a general approach to guide these discussions in 
future? Are we aiming to look like SQL as we evolve, and if so which products 
do we want to be informed by?


I expect we’ll rehash it every API thread otherwise.

> On 8 Dec 2022, at 17:37, Benedict <bened...@apache.org> wrote:
> 
> 
> 
>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>> Yes, a quick search on Google shows some examples:
>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html
> 
> We seem to be mixing and matching our databases we use as templates here. 
> Most of the standard examples we use (Postgres, MySQL etc) do not offer this, 
> and do not offer equivalent functions on any of their other collection types.
> 
> If we are to use industry norms as inspiration or justification, subqueries 
> do seem to be the approach taken. So the question is really why we should 
> introduce non-standard variants of fairly atypical functions instead of the 
> more typical approach of subqueries? Because it’s more work, or some other 
> reason? We should be explicit.
> 
> If we aren’t using industry norms as our justification, we should probably 
> jettison the lot as honestly the landscape is a mess in this area.
> 
>> 
>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>> Yes, again we can search for some examples:
>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html
>> https://docs.databricks.com/sql/language-manual/functions/array_size.html
> 
> My point here was that none I could find use ARRAY_COUNT - either ARRAY_SIZE 
> or ARRAY_LENGTH
> 
>>> 3) A map is a collection in C* parlance, but I gather from below you expect 
>>> these methods not to operate on them?
>> Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN 
>> and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG 
>> require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG 
>> functions above require a numeric array. 
> 
> Should collection_max only work on list and set when we can call MAX(map) 
> today (at least I think we can)? Whatever we decide our aggregate behaviour 
> is should be consistent here.
> 
>>> Does ARRAY_MAX operate on single values? If we are to base our decisions on 
>>> norms elsewhere, we should be consistent about it.
>> It doesn't in any the examples above. Those functions aren't a standard so I 
>> don't know if there are others dbs around that support it. In any case, the 
>> fact that we look for inspiration in other databases to minimize surprise 
>> etc. doesn't mean that we have to do exactly the same. After all, CQL is not 
>> SQL and our collections aren't SQL arrays.
>> 
>> Note that the collection functions added by CASSANDRA-8877 don't operate on 
>> single values either. That idea was proposed by Yifan on CASSANDRA-18078 and 
>> it looked good to Francisco and me. The patch is on CASSANDRA-18085, already 
>> reviewed and blocked waiting on the outcome of this discussion.
>> 
>> The new collection functions can do the same as the new MAXWRITE function, 
>> but not only for getting max timestamps, but also min timestamps and min/max 
>> ttls. The missing part is that MAXWRITE can accept both collections and 
>> single elements, so callers don't need to know the type of the column. 
>> That's the motivation behind the idea of doing the same with the collection 
>> functions, so they can entirely replace MAXWRITE.
>> 
>> However I wouldn't be against leaving the collection functions working only 
>> on collections, as originally designed, and as they currently are on trunk. 
>> The question is what we do with MAXWRITETIME. That function is also only on 
>> trunk, and it might be repetitive given the more generic collection 
>> functions. It's also a bit odd that there isn't, for example, a similar 
>> MINTTL function. Maybe we should start a separate discussion thread about 
>> that new function?
> 
> I think we should figure out our overall strategy - these are all pieces of 
> the puzzle IMO. But I guess the above questions seem to come first and will 
> shape this. I would be in favour of some general approach, however, such as 
> either first casting to a collection, or passing an aggregation operator to 
> WRITETIME.
> 
> 
> 
>>> On 8 Dec 2022, at 17:13, Andrés de la Peña <adelap...@apache.org> wrote:
>>> 
>> 
>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>> Yes, a quick search on Google shows some examples:
>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html
>> 
>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>> Yes, again we can search for some examples:
>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html
>> https://docs.databricks.com/sql/language-manual/functions/array_size.html
>> 
>>> 3) A map is a collection in C* parlance, but I gather from below you expect 
>>> these methods not to operate on them?
>> Nope, only COLLECTION_COUNT works on sets, lists and maps. COLLECTION_MIN 
>> and COLLECTION_MAX require a set or list. COLLECTION_SUM and COLLECTION_AVG 
>> require a numeric collection, the same way that the ARRAY_SUM and ARRAY_AVG 
>> functions above require a numeric array. 
>> 
>>> Does ARRAY_MAX operate on single values? If we are to base our decisions on 
>>> norms elsewhere, we should be consistent about it.
>> It doesn't in any the examples above. Those functions aren't a standard so I 
>> don't know if there are others dbs around that support it. In any case, the 
>> fact that we look for inspiration in other databases to minimize surprise 
>> etc. doesn't mean that we have to do exactly the same. After all, CQL is not 
>> SQL and our collections aren't SQL arrays.
>> 
>> Note that the collection functions added by CASSANDRA-8877 don't operate on 
>> single values either. That idea was proposed by Yifan on CASSANDRA-18078 and 
>> it looked good to Francisco and me. The patch is on CASSANDRA-18085, already 
>> reviewed and blocked waiting on the outcome of this discussion.
>> 
>> The new collection functions can do the same as the new MAXWRITE function, 
>> but not only for getting max timestamps, but also min timestamps and min/max 
>> ttls. The missing part is that MAXWRITE can accept both collections and 
>> single elements, so callers don't need to know the type of the column. 
>> That's the motivation behind the idea of doing the same with the collection 
>> functions, so they can entirely replace MAXWRITE.
>> 
>> However I wouldn't be against leaving the collection functions working only 
>> on collections, as originally designed, and as they currently are on trunk. 
>> The question is what we do with MAXWRITETIME. That function is also only on 
>> trunk, and it might be repetitive given the more generic collection 
>> functions. It's also a bit odd that there isn't, for example, a similar 
>> MINTTL function. Maybe we should start a separate discussion thread about 
>> that new function?
>> 
>> 
>> 
>>> On Thu, 8 Dec 2022 at 14:21, Benedict <bened...@apache.org> wrote:
>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>>> 3) A map is a collection in C* parlance, but I gather from below you expect 
>>> these methods not to operate on them?
>>> 
>>> Does ARRAY_MAX operate on single values? If we are to base our decisions on 
>>> norms elsewhere, we should be consistent about it.
>>> 
>>> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. Some 
>>> databases also have Set or Map types, such as MySQL’s Set and Postgres’ 
>>> hstore. These databases only support ARRAY_ functions, seemingly, plus 
>>> special MULTISET operators defined by the SQL standard where that data type 
>>> is supported.
>>> 
>>> 
>>> 
>>>>> On 8 Dec 2022, at 12:11, Andrés de la Peña <adelap...@apache.org> wrote:
>>>>> 
>>>> 
>>>> "ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element in a list 
>>>> are not an uncommon practice. You can find them in SparkSQL, Amazon 
>>>> Timestream, Teradata, etc. Since we have what we call collections instead 
>>>> or arrays, it makes sense to call the analogous functions 
>>>> "COLLECTION_MAX", "COLLECTION_MIN", etc.
>>>> 
>>>> As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and "MAP_VALUES" 
>>>> functions to get the keys or the values of a map, so one can feed them to 
>>>> "MAX", "COLLECTION_MAX", etc. That isn't anything too original either, you 
>>>> can find identical functions on SparkSQL for example.
>>>> 
>>>> I find simple utility functions easier to use than subqueries. But we 
>>>> don't have to chose. We can also have subqueries if someone finds the time 
>>>> to work on them.
>>>> 
>>>>> On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev 
>>>>> <dev@cassandra.apache.org> wrote:
>>>>>  I think the semantics of the situation is important here.
>>>>> 
>>>>> Let’s take MAX as our example aggregate function..
>>>>> 
>>>>> We all expect that in a DB context MAX(column) will return the value of 
>>>>> the column with the maximum value. That is the expected semantics of MAX.
>>>>> 
>>>>> The question here is that there are data types that are multi-valued and 
>>>>> there is a desire to apply MAX to the values within the column. I would 
>>>>> expect that this would return the maximum value of the column every row 
>>>>> in the DB.
>>>>> 
>>>>> So if there were a keyword that operated like the Java BiFunction class 
>>>>> where the Function would apply a second function to the column data. For 
>>>>> purposes of this discussion let’s call this Function APPLY.
>>>>> 
>>>>> So APPLY( MAX, column ) would return the maximum value from the column 
>>>>> for each row in the DB.
>>>>> 
>>>>> MAX(APPLY(MAX,column)) would get the maximum value from the column across 
>>>>> all the rows.
>>>>> 
>>>>> Similarly APPLY could be used with other functions MAX(APPLY(MIN,column)) 
>>>>> the largest minimum value from the column across all rows.
>>>>> 
>>>>> These statements make clear semantically what is being asked for.
>>>>> 
>>>>>> On Thu, Dec 8, 2022 at 10:57 AM Benedict <bened...@apache.org> wrote:
>>>>>> I meant unnest, not unwrap.
>>>>>> 
>>>>>>>> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote:
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> I do not think we should have functions that aggregate across rows and 
>>>>>>>> functions that operate within a row use the same name.
>>>>>>> 
>>>>>>> I’m sympathetic to that view for sure. I wouldn’t be too disappointed 
>>>>>>> by that outcome, and SQL engines seem to take a similar approach, 
>>>>>>> however they mostly rely on sub-queries to get around this problem, and 
>>>>>>> the SQL standard introduces UNWRAP for operating on arrays (by 
>>>>>>> translating them into a table), permitting subqueries to aggregate 
>>>>>>> them. It seems to me we have four options:
>>>>>>> 
>>>>>>> 1) introduce functionality similar to UNWRAP and subqueries 
>>>>>>> 2) introduce new syntax to permit operating on collections with the 
>>>>>>> same functions
>>>>>>> 3) permit the same functions to operate on both, with a precedence 
>>>>>>> order, and introduce syntax to permit breaking the precedence order
>>>>>>> 4) introduce new functions
>>>>>>> 
>>>>>>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) AS 
>>>>>>> max_item FROM table
>>>>>>> 
>>>>>>> (2) and (3) might look something like:
>>>>>>> 
>>>>>>> SELECT MAX(list AS COLLECTION) or 
>>>>>>> SELECT MAX(list AS ROWS)
>>>>>>> 
>>>>>>> (4) might look something like we have already, but perhaps with 
>>>>>>> different names
>>>>>>> 
>>>>>>>> The comparator for collections is the lexicographical compare on the 
>>>>>>>> collection items
>>>>>>> 
>>>>>>> This is a fair point, I mistakenly thought it sorted first on size. 
>>>>>>> Even this definition is a little funkier for Map types, where the 
>>>>>>> values of a key may cause something to sort earlier than a map whose 
>>>>>>> next key sorts first. There are multiple potential lexicographical 
>>>>>>> sorts for Maps (i.e., by keys first, then values, or by (key, value) 
>>>>>>> pairs), so this is particularly poorly defined IMO.
>>>>>>> 
>>>>>>> The maximum of a blob type is pretty well defined I think, as are 
>>>>>>> boolean, inetaddress etc. However, even for List or Set collections 
>>>>>>> there’s multiple reasonable functions one could define for maximum, so 
>>>>>>> it would make more sense to me to permit the user to define the 
>>>>>>> comparison as part of the MAX function if we are to offer it. However, 
>>>>>>> with the lexicographical definition we have I am somewhat less 
>>>>>>> concerned for Set and List. Map seems like a real problem though, if we 
>>>>>>> support these operators (which perhaps we do not).
>>>>>>> 
>>>>>>> 
>>>>>>>>> On 7 Dec 2022, at 12:13, Andrés de la Peña <adelap...@apache.org> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> The comparator for collections is the lexicographical compare on the 
>>>>>>>> collection items. That might nor be the more useful thing but it's not 
>>>>>>>> impossible to imagine cases where that ordering can be useful. To make 
>>>>>>>> a random example, you can use a list column to store the name and 
>>>>>>>> surnames of a person, considering that some persons can have multiple 
>>>>>>>> surnames. You can then sort rows based on that list column, to get the 
>>>>>>>> names in alphabetical order, or to get the first or last person 
>>>>>>>> according to that order. I'm sure we can think on more cases were the 
>>>>>>>> lexicographical order of a list can be useful, although I agree that's 
>>>>>>>> is not the greatest feature ever.
>>>>>>>> 
>>>>>>>> It's worth mentioning that collections are not the only data types 
>>>>>>>> where the MIN/MAX functions are of dubious utility. For example, blob 
>>>>>>>> columns can also be used with MIN/MAX. Same as with collections, the 
>>>>>>>> min/max blobs are selected according to the comparator for the data 
>>>>>>>> type. That comparator is the lexicographic compare on the unsigned 
>>>>>>>> values of the byte contents. 
>>>>>>>> 
>>>>>>>> The utility of MIN/MAX on inet and boolean columns isn't very clear 
>>>>>>>> either, although one can always imagine use cases. Fox example, MAX of 
>>>>>>>> a boolean column can be used as a logical disjunction.
>>>>>>>> 
>>>>>>>> If we were to special-case MIN/MAX functions to reject collections, we 
>>>>>>>> should also reject other data types such as, at least, blobs. That 
>>>>>>>> would require a deprecation plan.
>>>>>>>> 
>>>>>>>> Also, it's not that the comparator used by MIN/MAX is an internal 
>>>>>>>> obscure thing. The action of that comparator is very visible when any 
>>>>>>>> of those data types is used in a clustering column, and it's used as 
>>>>>>>> the basis for "ORDER BY" clauses. Should we also reject blobs, 
>>>>>>>> collections, tuples and UDTs on "ORDER BY"? I don't think so.
>>>>>>>> 
>>>>>>>> I rather think that basing MIN/MAX on the regular order of the column 
>>>>>>>> data type is consistent, easy to do and easy to understand. 
>>>>>>>> 
>>>>>>>> I don't see the need to add rules explicitly forbidding some data 
>>>>>>>> types on MIN/MAX functions just because we can't easily figure out a 
>>>>>>>> use case for their ordering. Especially when we are exposing that same 
>>>>>>>> ordering on clusterings and "ORDER BY".
>>>>>>>> 
>>>>>>>>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan <jeremiah.jor...@gmail.com> 
>>>>>>>>> wrote:
>>>>>>>>> If the functionality truly has never actually worked, then throwing 
>>>>>>>>> an error that MAX is not supported for collections seems reasonable.
>>>>>>>>> 
>>>>>>>>> But we should throw an error, I do not think we should have functions 
>>>>>>>>> that aggregate across rows and functions that operate within a row 
>>>>>>>>> use the same name.
>>>>>>>>> 
>>>>>>>>> My expectation as a user would be that MAX either always aggregates 
>>>>>>>>> across rows, so results in a single row of output or always operates 
>>>>>>>>> within a row, so returns the full set of rows matching the query.
>>>>>>>>> 
>>>>>>>>> So if we want a max that aggregates across rows that works for 
>>>>>>>>> collections we could change it to return the aggregated max across 
>>>>>>>>> all rows. Or we just leave it as an error and if someone wants the 
>>>>>>>>> max across all rows they would ask for MAX(COLLECTION_MAX(column)). 
>>>>>>>>> Yes I still agree COLLECTION_MAX may be a bad name.
>>>>>>>>> 
>>>>>>>>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> wrote:
>>>>>>>>> > 
>>>>>>>>> > As far as I am aware it has never worked in a release, and so 
>>>>>>>>> > deprecating it is probably not as challenging as you think. Only 
>>>>>>>>> > folk that have been able to parse the raw bytes of the collection 
>>>>>>>>> > in storage format would be affected - which we can probably treat 
>>>>>>>>> > as zero.
>>>>>>>>> > 
>>>>>>>>> > 
>>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan 
>>>>>>>>> >> <jeremiah.jor...@gmail.com> wrote:
>>>>>>>>> >> 
>>>>>>>>> >> 
>>>>>>>>> >>> 
>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates 
>>>>>>>>> >>> over rows containing collections, returning the collection with 
>>>>>>>>> >>> the most elements. This is just a nonsensical operation to 
>>>>>>>>> >>> support IMO. We should decide as a community whether we “fix” 
>>>>>>>>> >>> this aggregation, or remove it.
>>>>>>>>> >> 
>>>>>>>>> >> The current MAX function does not work this way afaik?  It returns 
>>>>>>>>> >> the row with the column that has the highest value in clustering 
>>>>>>>>> >> order sense, like if the collection was used as a clustering key.  
>>>>>>>>> >> While that also may have limited use, I don’t think it worth while 
>>>>>>>>> >> to deprecate such use and all the headache that comes with doing 
>>>>>>>>> >> so.
>>>>>>>>> >> 
>>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, 
>>>>>>>>> >>> etc, same as for aggregations.
>>>>>>>>> >> 
>>>>>>>>> >> If we actually wanted to move towards using the existing names 
>>>>>>>>> >> with new meanings, then I think that would take us multiple major 
>>>>>>>>> >> releases.  First deprecate existing use in current releases.  Then 
>>>>>>>>> >> make it an error in the next major release X.  Then change the 
>>>>>>>>> >> behavior in major release X+1.  Just switching the behavior 
>>>>>>>>> >> without having a major where such queries error out would make a 
>>>>>>>>> >> bunch of user queries start returning “wrong” data.
>>>>>>>>> >> Also I don’t think those functions being cross row aggregations 
>>>>>>>>> >> for some column types, but within row collection operations for 
>>>>>>>>> >> other types, is any more intuitive, and actually would be more 
>>>>>>>>> >> confusing.  So I am -1 on using the same names.
>>>>>>>>> >> 
>>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>>>>>>>> >>> operate over non-collection types when they are explicitly 
>>>>>>>>> >>> collection variants.
>>>>>>>>> >> 
>>>>>>>>> >> While I could see some point to this, I do not think it would be 
>>>>>>>>> >> confusing for something named collection_XXX to treat a 
>>>>>>>>> >> non-collection as a collection of 1.  But maybe there is a better 
>>>>>>>>> >> name for these function.  Rather than seeing them as collection 
>>>>>>>>> >> variants, we should see them as variants that operate on the data 
>>>>>>>>> >> in a single row, rather than aggregating across multiple rows.  
>>>>>>>>> >> But even with that perspective I don’t know what the best name 
>>>>>>>>> >> would be.
>>>>>>>>> >> 
>>>>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict <bened...@apache.org> wrote:
>>>>>>>>> >>> 
>>>>>>>>> >>> Thanks Andres, I think community input on direction here will be 
>>>>>>>>> >>> invaluable. There’s a bunch of interrelated tickets, and my 
>>>>>>>>> >>> opinions are as follows:
>>>>>>>>> >>> 
>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that operates 
>>>>>>>>> >>> over rows containing collections, returning the collection with 
>>>>>>>>> >>> the most elements. This is just a nonsensical operation to 
>>>>>>>>> >>> support IMO. We should decide as a community whether we “fix” 
>>>>>>>>> >>> this aggregation, or remove it.
>>>>>>>>> >>> 2. I think “collection_" prefixed methods are non-intuitive for 
>>>>>>>>> >>> discovery, and all-else equal it would be better to use MAX,MIN, 
>>>>>>>>> >>> etc, same as for aggregations.
>>>>>>>>> >>> 3. I think it is peculiar to permit methods named collection_ to 
>>>>>>>>> >>> operate over non-collection types when they are explicitly 
>>>>>>>>> >>> collection variants.
>>>>>>>>> >>> 
>>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which remains 
>>>>>>>>> >>> ambiguous, but this could be solved by either providing a 
>>>>>>>>> >>> separate method for collections (e.g. SIZE) which seems fine to 
>>>>>>>>> >>> me, or by offering a precedence order for matching and a keyword 
>>>>>>>>> >>> for overriding the precedence order (e.g. COUNT(collection AS 
>>>>>>>>> >>> COLLECTION)).
>>>>>>>>> >>> 
>>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I think this 
>>>>>>>>> >>> can be solved several ways. 
>>>>>>>>> >>> - We could permit explicit casts to collection types, that for a 
>>>>>>>>> >>> collection type would be a no-op, and for a single value would 
>>>>>>>>> >>> create a collection
>>>>>>>>> >>> - With precedence orders, by always selecting the scalar function 
>>>>>>>>> >>> last
>>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator reduce 
>>>>>>>>> >>> function to resolve multiple values
>>>>>>>>> >>> 
>>>>>>>>> >>> These decisions all imply trade-offs on each other, and affect 
>>>>>>>>> >>> the evolution of CQL, so I think community input would be helpful.
>>>>>>>>> >>> 
>>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
>>>>>>>>> >>>>> <adelap...@apache.org> wrote:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> 
>>>>>>>>> >>>> This will require some long introduction for context:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with min/max 
>>>>>>>>> >>>> column value according to their comparator. For collections, the 
>>>>>>>>> >>>> comparison is on the lexicographical order of the collection 
>>>>>>>>> >>>> elements. That's the very same comparator that is used when 
>>>>>>>>> >>>> collections are used as clustering keys and for ORDER BY.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used to make 
>>>>>>>>> >>>> that the results were presented in their unserialized form, 
>>>>>>>>> >>>> although the row selection was correct. That bug was recently 
>>>>>>>>> >>>> solved by CASSANDRA-17811. During that ticket it was also 
>>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on collection 
>>>>>>>>> >>>> since applying those functions to collections, since they don't 
>>>>>>>>> >>>> seem super useful. However, that option was quickly discarded 
>>>>>>>>> >>>> and the operation was fixed so the MIN/MAX functions correctly 
>>>>>>>>> >>>> work for every data type.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that can 
>>>>>>>>> >>>> perform aggregations of the elements of a collection. Those 
>>>>>>>>> >>>> where named "map_keys", "map_values", "collection_min", 
>>>>>>>>> >>>> "collection_max", "collection_sum", and "collection_count". 
>>>>>>>>> >>>> Those are the names mentioned on the mail list thread about 
>>>>>>>>> >>>> function naming conventions. Despite doing a kind of 
>>>>>>>>> >>>> within-collection aggregation, these functions are not what we 
>>>>>>>>> >>>> usually call aggregate functions, since they don't aggregate 
>>>>>>>>> >>>> multiple rows together.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to trunk a 
>>>>>>>>> >>>> MAXWRITETIME function to get the max timestamp of a multi-cell 
>>>>>>>>> >>>> column. However, the new collection functions can be used in 
>>>>>>>>> >>>> combination with the WRITETIME and TTL functions to retrieve the 
>>>>>>>>> >>>> min/max/sum/avg timestamp or ttl of a multi-cell column. Since 
>>>>>>>>> >>>> the new functions give a generic way of aggreagting timestamps 
>>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed to 
>>>>>>>>> >>>> remove that MAXWRITETIME function.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed to remove 
>>>>>>>>> >>>> that function in favour of the new generic collection functions. 
>>>>>>>>> >>>> However, the MAXWRITETIME function can work on both single-cell 
>>>>>>>>> >>>> and multi-cell columns, whereas 
>>>>>>>>> >>>> "COLLECTION_MAX(WRITETIME(column))" would only work on 
>>>>>>>>> >>>> multi-cell columns, That's because MAXWRITETIME of a 
>>>>>>>>> >>>> not-multicell column doesn't return a collection, and one should 
>>>>>>>>> >>>> simply use "WRITETIME(column)" instead. So it was proposed in 
>>>>>>>>> >>>> CASSANDRA-18037 that collections functions applied to a 
>>>>>>>>> >>>> not-collection value consider that value as the only element of 
>>>>>>>>> >>>> a singleton collection. So, for example, COLLECTION_MAX(7) = 
>>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been reviewed 
>>>>>>>>> >>>> and it's mostly ready to commit.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Now we can go straight to the point:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating aggregate 
>>>>>>>>> >>>> functions applied to collections, the very same idea that was 
>>>>>>>>> >>>> mentioned on CASSANDRA-17811 description almost four months ago. 
>>>>>>>>> >>>> That way we could rename the new collection functions 
>>>>>>>>> >>>> MIN/MAX/SUM/AVG, same as the classic aggregate functions. That 
>>>>>>>>> >>>> way MIN/MAX/SUM/AVG would be an aggregate function when applied 
>>>>>>>>> >>>> to not-collection columns, and a scalar function when applied to 
>>>>>>>>> >>>> collection. We can't do that with COUNT because there would be 
>>>>>>>>> >>>> an ambiguity, so the proposal for that case is renaming 
>>>>>>>>> >>>> COLLECTION_COUNT to SIZE. Benedict, please correct me if I'm not 
>>>>>>>>> >>>> correctly exposing the proposal.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> I however would prefer to keep aggregate functions working on 
>>>>>>>>> >>>> collections, and keep the names of the new collection functions 
>>>>>>>>> >>>> as "COLLECTION_*". Reasons are:
>>>>>>>>> >>>> 
>>>>>>>>> >>>> 1 - Making aggregate functions not work on collections might be 
>>>>>>>>> >>>> cosidered as breaking backward compatibility and require a 
>>>>>>>>> >>>> deprecation plan. 
>>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections might not 
>>>>>>>>> >>>> look superuseful, but they make the set of aggregate functions 
>>>>>>>>> >>>> consistent and applicable to every column type.
>>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection functions 
>>>>>>>>> >>>> establishes a clear distinction between row aggregations and 
>>>>>>>>> >>>> collection aggregations, while at the same time exposing the 
>>>>>>>>> >>>> analogy between each pair of functions.
>>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to search for 
>>>>>>>>> >>>> workarounds such as using the column type when possible, or 
>>>>>>>>> >>>> trying to figure out synonyms like in the case of COUNT/SIZE. 
>>>>>>>>> >>>> Even if that works for this case, future functions can find more 
>>>>>>>>> >>>> trouble when trying to figure out workarounds to avoid clashing 
>>>>>>>>> >>>> with existing function names. For example, we might want to add 
>>>>>>>>> >>>> a SIZE function that gets the size in bytes of any column, or we 
>>>>>>>>> >>>> might want to add a MAX function that gets the maximum of a set 
>>>>>>>>> >>>> of columns, etc. And example of the synonym-based approach that 
>>>>>>>>> >>>> comes to mind is MySQL's MAX and GREATEST functions, where MAX 
>>>>>>>>> >>>> is for row aggregation and GREATEST is for column aggregation.
>>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column type, 
>>>>>>>>> >>>> we can't implement Yifan's proposal of making COLLECTION_MAX(7) 
>>>>>>>>> >>>> = COLLECTION_MAX([7]) = 7, which would be very useful for 
>>>>>>>>> >>>> combining collection functions with time functions.
>>>>>>>>> >>>> 
>>>>>>>>> >>>> What do others think? What should we do with aggregate functions 
>>>>>>>>> >>>> on collections, collection functions and MAXWRITETIME?
>>>>>>>>> >>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to