Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Josh McKenzie Fri, 09 Dec 2022 07:40:54 -0800

> I wonder if there’s any value in deciding on a general approach to guide 
> these discussions in future? Are we aiming to look like SQL as we evolve, and 
> if so which products do we want to be informed by?
CQL's proximity to SQL is a blessing and a curse IMO. How close do we expect 
ourselves to get to SQL functionality and support over time?


To me, the answer to that question informs whether we position ourselves 
against SQL syntax or against NoSQL / k/v bespoke big data syntax assuming 
we're thinking of this in binary terms.

Or maybe there's a more nuanced approach - are there things where we're 
functionally similar to SQL where it makes sense to have an industry syntax 
reference we model after and things where we're partitioned big-data NoSQL 
where it makes sense to have a different reference?

On Fri, Dec 9, 2022, at 9:04 AM, Benedict wrote:
> 
> Why would that require a fancy query optimiser? It doesn’t seem like it would 
> be much more difficult than modifying our function binding rules and having a 
> specialised version.
> 
> 
> 
> 
>> On 9 Dec 2022, at 13:26, Andrés de la Peña <adelap...@apache.org> wrote:
>> 
>> Note that specialized collection functions are also an opportunity for 
>> optimization. For example, COLLECTION_COUNT reads only the first bytes of a 
>> serialized collection, since those bytes contain the number of elements in 
>> that collection. The most simple implementation of COUNT(UNNEST(collection)) 
>> wouldn't do that. It would probably require to deserialize the entire 
>> collection. A fancy query optimizer could internally translate 
>> COUNT(UNNEST(collection)) to COLLECTION_COUNT(collection) to get a nice 
>> performance improvement. Unfortunately we don't have such optimizer at the 
>> moment.
>> 
>> I don't see a reason not to leave those collection functions as they are, 
>> renaming aside. They can be later complemented by the generic UNNEST, or 
>> subqueries, when someone is willing to work on those features.
>> 
>> It's not clear to me how UNNEST + UDA would operate on maps. We would still 
>> need a way to extract the keys or the values of maps, like the current 
>> MAP_KEYS and MAP_VALUES functions, wouldn't we?
>> 
>> As for the impossibility of applying COLLECTION_MAX, COLLECTION_MIN, etc. to 
>> maps, I wouldn't be against renaming those to SET_MAX, LIST_MAX, SET_MIN, 
>> SET_MAX, etc. Sets and lists have many things in common and it's a pity that 
>> we don't have a common name for them. This lack of a common name for lists 
>> and sets is something that permeates into the code at multiple places. This 
>> kind of problems are probably the reason why Java's maps aren't collections.
>> 
>> 
>> 
>> 
>> 
>> On Fri, 9 Dec 2022 at 11:26, Benedict <bened...@apache.org> wrote:
>>> 
>>> Right, this is basically my view - it can be syntactic sugar for UNNEST 
>>> subqueries as and when we offer those (if not now), but I think we should 
>>> be able to apply any UDA or aggregate to collections with some syntax 
>>> that’s ergonomic.
>>> 
>>> I don’t think APPLY is the right way to express it, my version was 
>>> MAX(column AS COLLECTION) which means bind this operator to the collection 
>>> rather than the rows (and since this is CAST-like, I’d say this is also a 
>>> reasonable way to apply aggregations to single values too)
>>> 
>>> But perhaps there’s some other third option. Or, if not, let’s simply 
>>> support UNNEST subqueries.
>>> 
>>> 
>>>> On 9 Dec 2022, at 11:19, Claude Warren, Jr via dev 
>>>> <dev@cassandra.apache.org> wrote:
>>>> 
>>>> I still think that semantically it makes sense to have a function that 
>>>> applies an aggregate to various collection types.  So rather than building 
>>>> ARRAY_MAX do APPLY(MAX, column)) or APPLY(MAX(column)) it is clear what is 
>>>> being requested and APPLY can be the source of truth for which aggregate 
>>>> functions work on which column types.
>>>> 
>>>> On Fri, Dec 9, 2022 at 10:28 AM Andrés de la Peña <adelap...@apache.org> 
>>>> wrote:
>>>>> Indeed this discussion is useful now that we know that there is 
>>>>> dissension about these changes. However, as those changes were happening 
>>>>> none of the persons involved on them felt the need of a discuss thread, 
>>>>> and I opened this thread as soon as Benedict objected the changes. I 
>>>>> think this is perfectly in line with ASF's wise policy about lazy 
>>>>> consensus: https://community.apache.org/committers/lazyConsensus.html
>>>>> 
>>>>>> My point here was that none I could find use ARRAY_COUNT - either 
>>>>>> ARRAY_SIZE or ARRAY_LENGTH
>>>>> 
>>>>> A quick search on Google shows:
>>>>> https://www.ibm.com/docs/en/psfa/7.2.1?topic=functions-array-count
>>>>> https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/arrayfun.html#fn-array-count
>>>>> https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ARRAY_COUNT.htm
>>>>> 
>>>>>> We seem to be mixing and matching our databases we use as templates 
>>>>>> here. Most of the standard examples we use (Postgres, MySQL etc) do not 
>>>>>> offer this, and do not offer equivalent functions on any of their other 
>>>>>> collection types.
>>>>> 
>>>>> I don't know what's wrong with those databases, nor what makes MySQL and 
>>>>> Postgres more standard than others. AFAIK MySQL doesn't even have an 
>>>>> array type, so it hardly is going to have array functions. Postgres 
>>>>> however does have arrays. Those arrays can manipulated with both 
>>>>> subqueries and an unnest function. However, this doesn't impede Postgres 
>>>>> to also have a set of array functions: 
>>>>> https://www.postgresql.org/docs/12/functions-array.html
>>>>> 
>>>>> It seems that it is difficult to find a DB supporting arrays that doesn't 
>>>>> also offer an assorted set of array functions. DBs can perfectly support 
>>>>> subqueries, unnesting and utility functions at the same time. For 
>>>>> example, with Postgres you can get the size of an array with a subquery, 
>>>>> or with UNNEST, or with the ARRAY_LENGTH function.
>>>>> 
>>>>> The collection functions that we discuss here are mostly analogous to 
>>>>> those sets of functions that we find anywhere. They are a quite small, 
>>>>> well encapsulated and not invasive feature that has the advantage of 
>>>>> being already done. Those functions don't seem to impede us to add 
>>>>> support for subqueries or unnesting whenever someone wants to work on 
>>>>> them.
>>>>> 
>>>>> Adding subqueries on a later stage wouldn't involve the deprecation of 
>>>>> the collection functions, since those are still useful as a shortcut, as 
>>>>> we see in other DBs out there.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, 8 Dec 2022 at 20:47, J. D. Jordan <jeremiah.jor...@gmail.com> 
>>>>> wrote:
>>>>>> 
>>>>>> I think this thread proves the point that a DISCUSS thread for API 
>>>>>> changes on dev@ will get more viewpoints than just having something in 
>>>>>> JIRA. I think this thread has been useful and should result in us having 
>>>>>> a better user facing API than without it.
>>>>>> 
>>>>>>> On Dec 8, 2022, at 1:57 PM, Andrés de la Peña <adelap...@apache.org> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I expect we’ll rehash it every API thread otherwise.
>>>>>>> 
>>>>>>> Since you bring up the topic, I understand that opposing to every 
>>>>>>> single reviewed decision that has been taken on CASSANDRA-17811, 
>>>>>>> CASSANDRA-8877, CASSANDRA-17425 and CASSANDRA-18085 could make an 
>>>>>>> argument in favour of the policy demanding a DISCUSS thread for every 
>>>>>>> new feature, big or small. One could argue that any change to the work 
>>>>>>> on this area that comes out from the current discussion could have been 
>>>>>>> prevented by a previous DISCUSS thread. 
>>>>>>> 
>>>>>>> However, those tickets have been around for long in a very public way, 
>>>>>>> and there hasn't been any controversy around them until now. So I think 
>>>>>>> that an initial thread wouldn't have prevented that anyone can 
>>>>>>> resurrect the discussion at any point in time if they haven't put the 
>>>>>>> time to look into the changes before. We have already seen those 
>>>>>>> after-commit discussions even for discussed, voted, approved and 
>>>>>>> reviewed CEPs. The only thing we would get is two discussion threads 
>>>>>>> instead of one. By the looks of it, I doubt that the suggested policy 
>>>>>>> about discuss threads is going to be accepted. 
>>>>>>> 
>>>>>>> In any case, this is a separate topic from what we're discussing here. 
>>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, 8 Dec 2022 at 18:21, Benedict <bened...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> It feels like this is a recurring kind of discussion, and I wonder if 
>>>>>>>> there’s any value in deciding on a general approach to guide these 
>>>>>>>> discussions in future? Are we aiming to look like SQL as we evolve, 
>>>>>>>> and if so which products do we want to be informed by?
>>>>>>>> 
>>>>>>>> I expect we’ll rehash it every API thread otherwise.
>>>>>>>> 
>>>>>>>>> On 8 Dec 2022, at 17:37, Benedict <bened...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>>>>>>>>>> Yes, a quick search on Google shows some examples:
>>>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbvYZ59Ye$>
>>>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd*rn8g__;fg!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbjOQZPkg$>
>>>>>>>>>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbgmn9ZWa$>
>>>>>>>>>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbr5lPV0O$>
>>>>>>>>> 
>>>>>>>>> We seem to be mixing and matching our databases we use as templates 
>>>>>>>>> here. Most of the standard examples we use (Postgres, MySQL etc) do 
>>>>>>>>> not offer this, and do not offer equivalent functions on any of their 
>>>>>>>>> other collection types.
>>>>>>>>> 
>>>>>>>>> If we are to use industry norms as inspiration or justification, 
>>>>>>>>> subqueries do seem to be the approach taken. So the question is 
>>>>>>>>> really why we should introduce non-standard variants of fairly 
>>>>>>>>> atypical functions instead of the more typical approach of 
>>>>>>>>> subqueries? Because it’s more work, or some other reason? We should 
>>>>>>>>> be explicit.
>>>>>>>>> 
>>>>>>>>> If we aren’t using industry norms as our justification, we should 
>>>>>>>>> probably jettison the lot as honestly the landscape is a mess in this 
>>>>>>>>> area.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>>>>>>>>>> Yes, again we can search for some examples:
>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.snowflake.com/en/sql-reference/functions/array_size.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbpinUdPC$>
>>>>>>>>>> https://docs.databricks.com/sql/language-manual/functions/array_size.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.databricks.com/sql/language-manual/functions/array_size.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbvVWWVdN$>
>>>>>>>>> 
>>>>>>>>> My point here was that none I could find use ARRAY_COUNT - either 
>>>>>>>>> ARRAY_SIZE or ARRAY_LENGTH
>>>>>>>>> 
>>>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below 
>>>>>>>>>>> you expect these methods not to operate on them?
>>>>>>>>>> Nope, only COLLECTION_COUNT works on sets, lists and maps. 
>>>>>>>>>> COLLECTION_MIN and COLLECTION_MAX require a set or list. 
>>>>>>>>>> COLLECTION_SUM and COLLECTION_AVG require a numeric collection, the 
>>>>>>>>>> same way that the ARRAY_SUM and ARRAY_AVG functions above require a 
>>>>>>>>>> numeric array. 
>>>>>>>>> 
>>>>>>>>> Should collection_max only work on list and set when we can call 
>>>>>>>>> MAX(map) today (at least I think we can)? Whatever we decide our 
>>>>>>>>> aggregate behaviour is should be consistent here.
>>>>>>>>> 
>>>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our 
>>>>>>>>>>> decisions on norms elsewhere, we should be consistent about it.
>>>>>>>>>> It doesn't in any the examples above. Those functions aren't a 
>>>>>>>>>> standard so I don't know if there are others dbs around that support 
>>>>>>>>>> it. In any case, the fact that we look for inspiration in other 
>>>>>>>>>> databases to minimize surprise etc. doesn't mean that we have to do 
>>>>>>>>>> exactly the same. After all, CQL is not SQL and our collections 
>>>>>>>>>> aren't SQL arrays.
>>>>>>>>>> 
>>>>>>>>>> Note that the collection functions added by CASSANDRA-8877 don't 
>>>>>>>>>> operate on single values either. That idea was proposed by Yifan on 
>>>>>>>>>> CASSANDRA-18078 and it looked good to Francisco and me. The patch is 
>>>>>>>>>> on CASSANDRA-18085, already reviewed and blocked waiting on the 
>>>>>>>>>> outcome of this discussion.
>>>>>>>>>> 
>>>>>>>>>> The new collection functions can do the same as the new MAXWRITE 
>>>>>>>>>> function, but not only for getting max timestamps, but also min 
>>>>>>>>>> timestamps and min/max ttls. The missing part is that MAXWRITE can 
>>>>>>>>>> accept both collections and single elements, so callers don't need 
>>>>>>>>>> to know the type of the column. That's the motivation behind the 
>>>>>>>>>> idea of doing the same with the collection functions, so they can 
>>>>>>>>>> entirely replace MAXWRITE.
>>>>>>>>>> 
>>>>>>>>>> However I wouldn't be against leaving the collection functions 
>>>>>>>>>> working only on collections, as originally designed, and as they 
>>>>>>>>>> currently are on trunk. The question is what we do with 
>>>>>>>>>> MAXWRITETIME. That function is also only on trunk, and it might be 
>>>>>>>>>> repetitive given the more generic collection functions. It's also a 
>>>>>>>>>> bit odd that there isn't, for example, a similar MINTTL function. 
>>>>>>>>>> Maybe we should start a separate discussion thread about that new 
>>>>>>>>>> function?
>>>>>>>>> 
>>>>>>>>> I think we should figure out our overall strategy - these are all 
>>>>>>>>> pieces of the puzzle IMO. But I guess the above questions seem to 
>>>>>>>>> come first and will shape this. I would be in favour of some general 
>>>>>>>>> approach, however, such as either first casting to a collection, or 
>>>>>>>>> passing an aggregation operator to WRITETIME.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 8 Dec 2022, at 17:13, Andrés de la Peña <adelap...@apache.org> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>>>>>>>>>> Yes, a quick search on Google shows some examples:
>>>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbvYZ59Ye$>
>>>>>>>>>> https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd~rn8g
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.teradata.com/r/kmuOwjp1zEYg98JsB8fu_A/gxz1nB7GclxNO5mBd*rn8g__;fg!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbjOQZPkg$>
>>>>>>>>>> https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.upsolver.com/sqlake/functions-and-operators-reference/array-functions/array_sum__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbgmn9ZWa$>
>>>>>>>>>> https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.firebolt.io/sql-reference/functions-reference/array-sum.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbr5lPV0O$>
>>>>>>>>>> 
>>>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>>>>>>>>>> Yes, again we can search for some examples:
>>>>>>>>>> https://docs.snowflake.com/en/sql-reference/functions/array_size.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.snowflake.com/en/sql-reference/functions/array_size.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbpinUdPC$>
>>>>>>>>>> https://docs.databricks.com/sql/language-manual/functions/array_size.html
>>>>>>>>>>  
>>>>>>>>>> <https://urldefense.com/v3/__https://docs.databricks.com/sql/language-manual/functions/array_size.html__;!!PbtH5S7Ebw!ZMa3Xj1FsU-kdDY-UdxDvJkrH48eMgWNcW6wi3nlJmuchoModeBwKK5smBIo0KLNCKoaTNQNO31kbvVWWVdN$>
>>>>>>>>>> 
>>>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below 
>>>>>>>>>>> you expect these methods not to operate on them?
>>>>>>>>>> Nope, only COLLECTION_COUNT works on sets, lists and maps. 
>>>>>>>>>> COLLECTION_MIN and COLLECTION_MAX require a set or list. 
>>>>>>>>>> COLLECTION_SUM and COLLECTION_AVG require a numeric collection, the 
>>>>>>>>>> same way that the ARRAY_SUM and ARRAY_AVG functions above require a 
>>>>>>>>>> numeric array. 
>>>>>>>>>> 
>>>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our 
>>>>>>>>>>> decisions on norms elsewhere, we should be consistent about it.
>>>>>>>>>> It doesn't in any the examples above. Those functions aren't a 
>>>>>>>>>> standard so I don't know if there are others dbs around that support 
>>>>>>>>>> it. In any case, the fact that we look for inspiration in other 
>>>>>>>>>> databases to minimize surprise etc. doesn't mean that we have to do 
>>>>>>>>>> exactly the same. After all, CQL is not SQL and our collections 
>>>>>>>>>> aren't SQL arrays.
>>>>>>>>>> 
>>>>>>>>>> Note that the collection functions added by CASSANDRA-8877 don't 
>>>>>>>>>> operate on single values either. That idea was proposed by Yifan on 
>>>>>>>>>> CASSANDRA-18078 and it looked good to Francisco and me. The patch is 
>>>>>>>>>> on CASSANDRA-18085, already reviewed and blocked waiting on the 
>>>>>>>>>> outcome of this discussion.
>>>>>>>>>> 
>>>>>>>>>> The new collection functions can do the same as the new MAXWRITE 
>>>>>>>>>> function, but not only for getting max timestamps, but also min 
>>>>>>>>>> timestamps and min/max ttls. The missing part is that MAXWRITE can 
>>>>>>>>>> accept both collections and single elements, so callers don't need 
>>>>>>>>>> to know the type of the column. That's the motivation behind the 
>>>>>>>>>> idea of doing the same with the collection functions, so they can 
>>>>>>>>>> entirely replace MAXWRITE.
>>>>>>>>>> 
>>>>>>>>>> However I wouldn't be against leaving the collection functions 
>>>>>>>>>> working only on collections, as originally designed, and as they 
>>>>>>>>>> currently are on trunk. The question is what we do with 
>>>>>>>>>> MAXWRITETIME. That function is also only on trunk, and it might be 
>>>>>>>>>> repetitive given the more generic collection functions. It's also a 
>>>>>>>>>> bit odd that there isn't, for example, a similar MINTTL function. 
>>>>>>>>>> Maybe we should start a separate discussion thread about that new 
>>>>>>>>>> function?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, 8 Dec 2022 at 14:21, Benedict <bened...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 1) Do they offer ARRAY_SUM or ARRAY_AVG?
>>>>>>>>>>> 2) Do they define ARRAY_COUNT or ARRAY_LENGTH?
>>>>>>>>>>> 3) A map is a collection in C* parlance, but I gather from below 
>>>>>>>>>>> you expect these methods not to operate on them?
>>>>>>>>>>> 
>>>>>>>>>>> Does ARRAY_MAX operate on single values? If we are to base our 
>>>>>>>>>>> decisions on norms elsewhere, we should be consistent about it.
>>>>>>>>>>> 
>>>>>>>>>>> It’s worth noting that ARRAY is an ISO SQL concept, as is MULTISET. 
>>>>>>>>>>> Some databases also have Set or Map types, such as MySQL’s Set and 
>>>>>>>>>>> Postgres’ hstore. These databases only support ARRAY_ functions, 
>>>>>>>>>>> seemingly, plus special MULTISET operators defined by the SQL 
>>>>>>>>>>> standard where that data type is supported.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 8 Dec 2022, at 12:11, Andrés de la Peña <adelap...@apache.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> "ARRAY_MAX" and "ARRAY_MIN" functions to get the max/min element 
>>>>>>>>>>>> in a list are not an uncommon practice. You can find them in 
>>>>>>>>>>>> SparkSQL, Amazon Timestream, Teradata, etc. Since we have what we 
>>>>>>>>>>>> call collections instead or arrays, it makes sense to call the 
>>>>>>>>>>>> analogous functions "COLLECTION_MAX", "COLLECTION_MIN", etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> As for maps, CASSANDRA-8877 also introduced "MAP_KEYS" and 
>>>>>>>>>>>> "MAP_VALUES" functions to get the keys or the values of a map, so 
>>>>>>>>>>>> one can feed them to "MAX", "COLLECTION_MAX", etc. That isn't 
>>>>>>>>>>>> anything too original either, you can find identical functions on 
>>>>>>>>>>>> SparkSQL for example.
>>>>>>>>>>>> 
>>>>>>>>>>>> I find simple utility functions easier to use than subqueries. But 
>>>>>>>>>>>> we don't have to chose. We can also have subqueries if someone 
>>>>>>>>>>>> finds the time to work on them.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, 8 Dec 2022 at 12:04, Claude Warren, Jr via dev 
>>>>>>>>>>>> <dev@cassandra.apache.org> wrote:
>>>>>>>>>>>>>  I think the semantics of the situation is important here.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Let’s take MAX as our example aggregate function..
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We all expect that in a DB context MAX(column) will return the 
>>>>>>>>>>>>> value of the column with the maximum value. That is the expected 
>>>>>>>>>>>>> semantics of MAX.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The question here is that there are data types that are 
>>>>>>>>>>>>> multi-valued and there is a desire to apply MAX to the values 
>>>>>>>>>>>>> within the column. I would expect that this would return the 
>>>>>>>>>>>>> maximum value of the column every row in the DB.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So if there were a keyword that operated like the Java BiFunction 
>>>>>>>>>>>>> class where the Function would apply a second function to the 
>>>>>>>>>>>>> column data. For purposes of this discussion let’s call this 
>>>>>>>>>>>>> Function APPLY.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So APPLY( MAX, column ) would return the maximum value from the 
>>>>>>>>>>>>> column for each row in the DB.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> MAX(APPLY(MAX,column)) would get the maximum value from the 
>>>>>>>>>>>>> column across all the rows.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Similarly APPLY could be used with other functions 
>>>>>>>>>>>>> MAX(APPLY(MIN,column)) the largest minimum value from the column 
>>>>>>>>>>>>> across all rows.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> These statements make clear semantically what is being asked for.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Dec 8, 2022 at 10:57 AM Benedict <bened...@apache.org> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I meant unnest, not unwrap.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 8 Dec 2022, at 10:34, Benedict <bened...@apache.org> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I do not think we should have functions that aggregate across 
>>>>>>>>>>>>>>>> rows and functions that operate within a row use the same name.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I’m sympathetic to that view for sure. I wouldn’t be too 
>>>>>>>>>>>>>>> disappointed by that outcome, and SQL engines seem to take a 
>>>>>>>>>>>>>>> similar approach, however they mostly rely on sub-queries to 
>>>>>>>>>>>>>>> get around this problem, and the SQL standard introduces UNWRAP 
>>>>>>>>>>>>>>> for operating on arrays (by translating them into a table), 
>>>>>>>>>>>>>>> permitting subqueries to aggregate them. It seems to me we have 
>>>>>>>>>>>>>>> four options:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1) introduce functionality similar to UNWRAP and subqueries 
>>>>>>>>>>>>>>> 2) introduce new syntax to permit operating on collections with 
>>>>>>>>>>>>>>> the same functions
>>>>>>>>>>>>>>> 3) permit the same functions to operate on both, with a 
>>>>>>>>>>>>>>> precedence order, and introduce syntax to permit breaking the 
>>>>>>>>>>>>>>> precedence order
>>>>>>>>>>>>>>> 4) introduce new functions
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (1) might look like SELECT (SELECT MAX(item) FROM UNWRAP(list)) 
>>>>>>>>>>>>>>> AS max_item FROM table
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (2) and (3) might look something like:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> SELECT MAX(list AS COLLECTION) or 
>>>>>>>>>>>>>>> SELECT MAX(list AS ROWS)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> (4) might look something like we have already, but perhaps with 
>>>>>>>>>>>>>>> different names
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The comparator for collections is the lexicographical compare 
>>>>>>>>>>>>>>>> on the collection items
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> This is a fair point, I mistakenly thought it sorted first on 
>>>>>>>>>>>>>>> size. Even this definition is a little funkier for Map types, 
>>>>>>>>>>>>>>> where the values of a key may cause something to sort earlier 
>>>>>>>>>>>>>>> than a map whose next key sorts first. There are multiple 
>>>>>>>>>>>>>>> potential lexicographical sorts for Maps (i.e., by keys first, 
>>>>>>>>>>>>>>> then values, or by (key, value) pairs), so this is particularly 
>>>>>>>>>>>>>>> poorly defined IMO.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The maximum of a blob type is pretty well defined I think, as 
>>>>>>>>>>>>>>> are boolean, inetaddress etc. However, even for List or Set 
>>>>>>>>>>>>>>> collections there’s multiple reasonable functions one could 
>>>>>>>>>>>>>>> define for maximum, so it would make more sense to me to permit 
>>>>>>>>>>>>>>> the user to define the comparison as part of the MAX function 
>>>>>>>>>>>>>>> if we are to offer it. However, with the lexicographical 
>>>>>>>>>>>>>>> definition we have I am somewhat less concerned for Set and 
>>>>>>>>>>>>>>> List. Map seems like a real problem though, if we support these 
>>>>>>>>>>>>>>> operators (which perhaps we do not).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 7 Dec 2022, at 12:13, Andrés de la Peña 
>>>>>>>>>>>>>>>> <adelap...@apache.org> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The comparator for collections is the lexicographical compare 
>>>>>>>>>>>>>>>> on the collection items. That might nor be the more useful 
>>>>>>>>>>>>>>>> thing but it's not impossible to imagine cases where that 
>>>>>>>>>>>>>>>> ordering can be useful. To make a random example, you can use 
>>>>>>>>>>>>>>>> a list column to store the name and surnames of a person, 
>>>>>>>>>>>>>>>> considering that some persons can have multiple surnames. You 
>>>>>>>>>>>>>>>> can then sort rows based on that list column, to get the names 
>>>>>>>>>>>>>>>> in alphabetical order, or to get the first or last person 
>>>>>>>>>>>>>>>> according to that order. I'm sure we can think on more cases 
>>>>>>>>>>>>>>>> were the lexicographical order of a list can be useful, 
>>>>>>>>>>>>>>>> although I agree that's is not the greatest feature ever.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> It's worth mentioning that collections are not the only data 
>>>>>>>>>>>>>>>> types where the MIN/MAX functions are of dubious utility. For 
>>>>>>>>>>>>>>>> example, blob columns can also be used with MIN/MAX. Same as 
>>>>>>>>>>>>>>>> with collections, the min/max blobs are selected according to 
>>>>>>>>>>>>>>>> the comparator for the data type. That comparator is the 
>>>>>>>>>>>>>>>> lexicographic compare on the unsigned values of the byte 
>>>>>>>>>>>>>>>> contents. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The utility of MIN/MAX on inet and boolean columns isn't very 
>>>>>>>>>>>>>>>> clear either, although one can always imagine use cases. Fox 
>>>>>>>>>>>>>>>> example, MAX of a boolean column can be used as a logical 
>>>>>>>>>>>>>>>> disjunction.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> If we were to special-case MIN/MAX functions to reject 
>>>>>>>>>>>>>>>> collections, we should also reject other data types such as, 
>>>>>>>>>>>>>>>> at least, blobs. That would require a deprecation plan.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Also, it's not that the comparator used by MIN/MAX is an 
>>>>>>>>>>>>>>>> internal obscure thing. The action of that comparator is very 
>>>>>>>>>>>>>>>> visible when any of those data types is used in a clustering 
>>>>>>>>>>>>>>>> column, and it's used as the basis for "ORDER BY" clauses. 
>>>>>>>>>>>>>>>> Should we also reject blobs, collections, tuples and UDTs on 
>>>>>>>>>>>>>>>> "ORDER BY"? I don't think so.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I rather think that basing MIN/MAX on the regular order of the 
>>>>>>>>>>>>>>>> column data type is consistent, easy to do and easy to 
>>>>>>>>>>>>>>>> understand. 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I don't see the need to add rules explicitly forbidding some 
>>>>>>>>>>>>>>>> data types on MIN/MAX functions just because we can't easily 
>>>>>>>>>>>>>>>> figure out a use case for their ordering. Especially when we 
>>>>>>>>>>>>>>>> are exposing that same ordering on clusterings and "ORDER BY".
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, 6 Dec 2022 at 18:56, J. D. Jordan 
>>>>>>>>>>>>>>>> <jeremiah.jor...@gmail.com> wrote:
>>>>>>>>>>>>>>>>> If the functionality truly has never actually worked, then 
>>>>>>>>>>>>>>>>> throwing an error that MAX is not supported for collections 
>>>>>>>>>>>>>>>>> seems reasonable.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> But we should throw an error, I do not think we should have 
>>>>>>>>>>>>>>>>> functions that aggregate across rows and functions that 
>>>>>>>>>>>>>>>>> operate within a row use the same name.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> My expectation as a user would be that MAX either always 
>>>>>>>>>>>>>>>>> aggregates across rows, so results in a single row of output 
>>>>>>>>>>>>>>>>> or always operates within a row, so returns the full set of 
>>>>>>>>>>>>>>>>> rows matching the query.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So if we want a max that aggregates across rows that works 
>>>>>>>>>>>>>>>>> for collections we could change it to return the aggregated 
>>>>>>>>>>>>>>>>> max across all rows. Or we just leave it as an error and if 
>>>>>>>>>>>>>>>>> someone wants the max across all rows they would ask for 
>>>>>>>>>>>>>>>>> MAX(COLLECTION_MAX(column)). Yes I still agree COLLECTION_MAX 
>>>>>>>>>>>>>>>>> may be a bad name.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> > On Dec 6, 2022, at 11:55 AM, Benedict <bened...@apache.org> 
>>>>>>>>>>>>>>>>> > wrote:
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > As far as I am aware it has never worked in a release, and 
>>>>>>>>>>>>>>>>> > so deprecating it is probably not as challenging as you 
>>>>>>>>>>>>>>>>> > think. Only folk that have been able to parse the raw bytes 
>>>>>>>>>>>>>>>>> > of the collection in storage format would be affected - 
>>>>>>>>>>>>>>>>> > which we can probably treat as zero.
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> > 
>>>>>>>>>>>>>>>>> >> On 6 Dec 2022, at 17:31, Jeremiah D Jordan 
>>>>>>>>>>>>>>>>> >> <jeremiah.jor...@gmail.com> wrote:
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that 
>>>>>>>>>>>>>>>>> >>> operates over rows containing collections, returning the 
>>>>>>>>>>>>>>>>> >>> collection with the most elements. This is just a 
>>>>>>>>>>>>>>>>> >>> nonsensical operation to support IMO. We should decide as 
>>>>>>>>>>>>>>>>> >>> a community whether we “fix” this aggregation, or remove 
>>>>>>>>>>>>>>>>> >>> it.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >> The current MAX function does not work this way afaik?  It 
>>>>>>>>>>>>>>>>> >> returns the row with the column that has the highest value 
>>>>>>>>>>>>>>>>> >> in clustering order sense, like if the collection was used 
>>>>>>>>>>>>>>>>> >> as a clustering key.  While that also may have limited 
>>>>>>>>>>>>>>>>> >> use, I don’t think it worth while to deprecate such use 
>>>>>>>>>>>>>>>>> >> and all the headache that comes with doing so.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >>> 2. I think “collection_" prefixed methods are 
>>>>>>>>>>>>>>>>> >>> non-intuitive for discovery, and all-else equal it would 
>>>>>>>>>>>>>>>>> >>> be better to use MAX,MIN, etc, same as for aggregations.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >> If we actually wanted to move towards using the existing 
>>>>>>>>>>>>>>>>> >> names with new meanings, then I think that would take us 
>>>>>>>>>>>>>>>>> >> multiple major releases.  First deprecate existing use in 
>>>>>>>>>>>>>>>>> >> current releases.  Then make it an error in the next major 
>>>>>>>>>>>>>>>>> >> release X.  Then change the behavior in major release X+1. 
>>>>>>>>>>>>>>>>> >>  Just switching the behavior without having a major where 
>>>>>>>>>>>>>>>>> >> such queries error out would make a bunch of user queries 
>>>>>>>>>>>>>>>>> >> start returning “wrong” data.
>>>>>>>>>>>>>>>>> >> Also I don’t think those functions being cross row 
>>>>>>>>>>>>>>>>> >> aggregations for some column types, but within row 
>>>>>>>>>>>>>>>>> >> collection operations for other types, is any more 
>>>>>>>>>>>>>>>>> >> intuitive, and actually would be more confusing.  So I am 
>>>>>>>>>>>>>>>>> >> -1 on using the same names.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >>> 3. I think it is peculiar to permit methods named 
>>>>>>>>>>>>>>>>> >>> collection_ to operate over non-collection types when 
>>>>>>>>>>>>>>>>> >>> they are explicitly collection variants.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >> While I could see some point to this, I do not think it 
>>>>>>>>>>>>>>>>> >> would be confusing for something named collection_XXX to 
>>>>>>>>>>>>>>>>> >> treat a non-collection as a collection of 1.  But maybe 
>>>>>>>>>>>>>>>>> >> there is a better name for these function.  Rather than 
>>>>>>>>>>>>>>>>> >> seeing them as collection variants, we should see them as 
>>>>>>>>>>>>>>>>> >> variants that operate on the data in a single row, rather 
>>>>>>>>>>>>>>>>> >> than aggregating across multiple rows.  But even with that 
>>>>>>>>>>>>>>>>> >> perspective I don’t know what the best name would be.
>>>>>>>>>>>>>>>>> >> 
>>>>>>>>>>>>>>>>> >>>> On Dec 6, 2022, at 7:30 AM, Benedict 
>>>>>>>>>>>>>>>>> >>>> <bened...@apache.org> wrote:
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> Thanks Andres, I think community input on direction here 
>>>>>>>>>>>>>>>>> >>> will be invaluable. There’s a bunch of interrelated 
>>>>>>>>>>>>>>>>> >>> tickets, and my opinions are as follows:
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> 1. I think it is a mistake to offer a function MAX that 
>>>>>>>>>>>>>>>>> >>> operates over rows containing collections, returning the 
>>>>>>>>>>>>>>>>> >>> collection with the most elements. This is just a 
>>>>>>>>>>>>>>>>> >>> nonsensical operation to support IMO. We should decide as 
>>>>>>>>>>>>>>>>> >>> a community whether we “fix” this aggregation, or remove 
>>>>>>>>>>>>>>>>> >>> it.
>>>>>>>>>>>>>>>>> >>> 2. I think “collection_" prefixed methods are 
>>>>>>>>>>>>>>>>> >>> non-intuitive for discovery, and all-else equal it would 
>>>>>>>>>>>>>>>>> >>> be better to use MAX,MIN, etc, same as for aggregations.
>>>>>>>>>>>>>>>>> >>> 3. I think it is peculiar to permit methods named 
>>>>>>>>>>>>>>>>> >>> collection_ to operate over non-collection types when 
>>>>>>>>>>>>>>>>> >>> they are explicitly collection variants.
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> Given (1), (2) becomes simple except for COUNT which 
>>>>>>>>>>>>>>>>> >>> remains ambiguous, but this could be solved by either 
>>>>>>>>>>>>>>>>> >>> providing a separate method for collections (e.g. SIZE) 
>>>>>>>>>>>>>>>>> >>> which seems fine to me, or by offering a precedence order 
>>>>>>>>>>>>>>>>> >>> for matching and a keyword for overriding the precedence 
>>>>>>>>>>>>>>>>> >>> order (e.g. COUNT(collection AS COLLECTION)).
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> Given (2), (3) is a little more difficult. However, I 
>>>>>>>>>>>>>>>>> >>> think this can be solved several ways. 
>>>>>>>>>>>>>>>>> >>> - We could permit explicit casts to collection types, 
>>>>>>>>>>>>>>>>> >>> that for a collection type would be a no-op, and for a 
>>>>>>>>>>>>>>>>> >>> single value would create a collection
>>>>>>>>>>>>>>>>> >>> - With precedence orders, by always selecting the scalar 
>>>>>>>>>>>>>>>>> >>> function last
>>>>>>>>>>>>>>>>> >>> - By permitting WRITETIME to accept a binary operator 
>>>>>>>>>>>>>>>>> >>> reduce function to resolve multiple values
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>> These decisions all imply trade-offs on each other, and 
>>>>>>>>>>>>>>>>> >>> affect the evolution of CQL, so I think community input 
>>>>>>>>>>>>>>>>> >>> would be helpful.
>>>>>>>>>>>>>>>>> >>> 
>>>>>>>>>>>>>>>>> >>>>> On 6 Dec 2022, at 12:44, Andrés de la Peña 
>>>>>>>>>>>>>>>>> >>>>> <adelap...@apache.org> wrote:
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> This will require some long introduction for context:
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> The MAX/MIN functions aggregate rows to get the row with 
>>>>>>>>>>>>>>>>> >>>> min/max column value according to their comparator. For 
>>>>>>>>>>>>>>>>> >>>> collections, the comparison is on the lexicographical 
>>>>>>>>>>>>>>>>> >>>> order of the collection elements. That's the very same 
>>>>>>>>>>>>>>>>> >>>> comparator that is used when collections are used as 
>>>>>>>>>>>>>>>>> >>>> clustering keys and for ORDER BY.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> However, a bug in the MIN/MAX aggregate functions used 
>>>>>>>>>>>>>>>>> >>>> to make that the results were presented in their 
>>>>>>>>>>>>>>>>> >>>> unserialized form, although the row selection was 
>>>>>>>>>>>>>>>>> >>>> correct. That bug was recently solved by 
>>>>>>>>>>>>>>>>> >>>> CASSANDRA-17811. During that ticket it was also 
>>>>>>>>>>>>>>>>> >>>> considered the option of simply disabling MIN/MAX on 
>>>>>>>>>>>>>>>>> >>>> collection since applying those functions to 
>>>>>>>>>>>>>>>>> >>>> collections, since they don't seem super useful. 
>>>>>>>>>>>>>>>>> >>>> However, that option was quickly discarded and the 
>>>>>>>>>>>>>>>>> >>>> operation was fixed so the MIN/MAX functions correctly 
>>>>>>>>>>>>>>>>> >>>> work for every data type.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> As a byproduct of the internal improvements of that fix, 
>>>>>>>>>>>>>>>>> >>>> CASSANDRA-8877 introduced a new set of functions that 
>>>>>>>>>>>>>>>>> >>>> can perform aggregations of the elements of a 
>>>>>>>>>>>>>>>>> >>>> collection. Those where named "map_keys", "map_values", 
>>>>>>>>>>>>>>>>> >>>> "collection_min", "collection_max", "collection_sum", 
>>>>>>>>>>>>>>>>> >>>> and "collection_count". Those are the names mentioned on 
>>>>>>>>>>>>>>>>> >>>> the mail list thread about function naming conventions. 
>>>>>>>>>>>>>>>>> >>>> Despite doing a kind of within-collection aggregation, 
>>>>>>>>>>>>>>>>> >>>> these functions are not what we usually call aggregate 
>>>>>>>>>>>>>>>>> >>>> functions, since they don't aggregate multiple rows 
>>>>>>>>>>>>>>>>> >>>> together.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> On a different line of work, CASSANDRA-17425 added to 
>>>>>>>>>>>>>>>>> >>>> trunk a MAXWRITETIME function to get the max timestamp 
>>>>>>>>>>>>>>>>> >>>> of a multi-cell column. However, the new collection 
>>>>>>>>>>>>>>>>> >>>> functions can be used in combination with the WRITETIME 
>>>>>>>>>>>>>>>>> >>>> and TTL functions to retrieve the min/max/sum/avg 
>>>>>>>>>>>>>>>>> >>>> timestamp or ttl of a multi-cell column. Since the new 
>>>>>>>>>>>>>>>>> >>>> functions give a generic way of aggreagting timestamps 
>>>>>>>>>>>>>>>>> >>>> ant TTLs of multi-cell columns, CASSANDRA-18078 proposed 
>>>>>>>>>>>>>>>>> >>>> to remove that MAXWRITETIME function.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> Yifan Cai, author of the MAXWRITETIME function, agreed 
>>>>>>>>>>>>>>>>> >>>> to remove that function in favour of the new generic 
>>>>>>>>>>>>>>>>> >>>> collection functions. However, the MAXWRITETIME function 
>>>>>>>>>>>>>>>>> >>>> can work on both single-cell and multi-cell columns, 
>>>>>>>>>>>>>>>>> >>>> whereas "COLLECTION_MAX(WRITETIME(column))" would only 
>>>>>>>>>>>>>>>>> >>>> work on multi-cell columns, That's because MAXWRITETIME 
>>>>>>>>>>>>>>>>> >>>> of a not-multicell column doesn't return a collection, 
>>>>>>>>>>>>>>>>> >>>> and one should simply use "WRITETIME(column)" instead. 
>>>>>>>>>>>>>>>>> >>>> So it was proposed in CASSANDRA-18037 that collections 
>>>>>>>>>>>>>>>>> >>>> functions applied to a not-collection value consider 
>>>>>>>>>>>>>>>>> >>>> that value as the only element of a singleton 
>>>>>>>>>>>>>>>>> >>>> collection. So, for example, COLLECTION_MAX(7) = 
>>>>>>>>>>>>>>>>> >>>> COLLECTION_MAX([7]) = 7. That ticket has already been 
>>>>>>>>>>>>>>>>> >>>> reviewed and it's mostly ready to commit.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> Now we can go straight to the point:
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> Recently Benedict brought back the idea of deprecating 
>>>>>>>>>>>>>>>>> >>>> aggregate functions applied to collections, the very 
>>>>>>>>>>>>>>>>> >>>> same idea that was mentioned on CASSANDRA-17811 
>>>>>>>>>>>>>>>>> >>>> description almost four months ago. That way we could 
>>>>>>>>>>>>>>>>> >>>> rename the new collection functions MIN/MAX/SUM/AVG, 
>>>>>>>>>>>>>>>>> >>>> same as the classic aggregate functions. That way 
>>>>>>>>>>>>>>>>> >>>> MIN/MAX/SUM/AVG would be an aggregate function when 
>>>>>>>>>>>>>>>>> >>>> applied to not-collection columns, and a scalar function 
>>>>>>>>>>>>>>>>> >>>> when applied to collection. We can't do that with COUNT 
>>>>>>>>>>>>>>>>> >>>> because there would be an ambiguity, so the proposal for 
>>>>>>>>>>>>>>>>> >>>> that case is renaming COLLECTION_COUNT to SIZE. 
>>>>>>>>>>>>>>>>> >>>> Benedict, please correct me if I'm not correctly 
>>>>>>>>>>>>>>>>> >>>> exposing the proposal.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> I however would prefer to keep aggregate functions 
>>>>>>>>>>>>>>>>> >>>> working on collections, and keep the names of the new 
>>>>>>>>>>>>>>>>> >>>> collection functions as "COLLECTION_*". Reasons are:
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> 1 - Making aggregate functions not work on collections 
>>>>>>>>>>>>>>>>> >>>> might be cosidered as breaking backward compatibility 
>>>>>>>>>>>>>>>>> >>>> and require a deprecation plan. 
>>>>>>>>>>>>>>>>> >>>> 2 - Keeping aggregate functions working on collections 
>>>>>>>>>>>>>>>>> >>>> might not look superuseful, but they make the set of 
>>>>>>>>>>>>>>>>> >>>> aggregate functions consistent and applicable to every 
>>>>>>>>>>>>>>>>> >>>> column type.
>>>>>>>>>>>>>>>>> >>>> 3 - Using the "COLLECTION_" prefix on collection 
>>>>>>>>>>>>>>>>> >>>> functions establishes a clear distinction between row 
>>>>>>>>>>>>>>>>> >>>> aggregations and collection aggregations, while at the 
>>>>>>>>>>>>>>>>> >>>> same time exposing the analogy between each pair of 
>>>>>>>>>>>>>>>>> >>>> functions.
>>>>>>>>>>>>>>>>> >>>> 4 - Not using the "COLLECTION_" prefix forces us to 
>>>>>>>>>>>>>>>>> >>>> search for workarounds such as using the column type 
>>>>>>>>>>>>>>>>> >>>> when possible, or trying to figure out synonyms like in 
>>>>>>>>>>>>>>>>> >>>> the case of COUNT/SIZE. Even if that works for this 
>>>>>>>>>>>>>>>>> >>>> case, future functions can find more trouble when trying 
>>>>>>>>>>>>>>>>> >>>> to figure out workarounds to avoid clashing with 
>>>>>>>>>>>>>>>>> >>>> existing function names. For example, we might want to 
>>>>>>>>>>>>>>>>> >>>> add a SIZE function that gets the size in bytes of any 
>>>>>>>>>>>>>>>>> >>>> column, or we might want to add a MAX function that gets 
>>>>>>>>>>>>>>>>> >>>> the maximum of a set of columns, etc. And example of the 
>>>>>>>>>>>>>>>>> >>>> synonym-based approach that comes to mind is MySQL's MAX 
>>>>>>>>>>>>>>>>> >>>> and GREATEST functions, where MAX is for row aggregation 
>>>>>>>>>>>>>>>>> >>>> and GREATEST is for column aggregation.
>>>>>>>>>>>>>>>>> >>>> 5 - If MIN/MAX function selection is based on the column 
>>>>>>>>>>>>>>>>> >>>> type, we can't implement Yifan's proposal of making 
>>>>>>>>>>>>>>>>> >>>> COLLECTION_MAX(7) = COLLECTION_MAX([7]) = 7, which would 
>>>>>>>>>>>>>>>>> >>>> be very useful for combining collection functions with 
>>>>>>>>>>>>>>>>> >>>> time functions.
>>>>>>>>>>>>>>>>> >>>> 
>>>>>>>>>>>>>>>>> >>>> What do others think? What should we do with aggregate 
>>>>>>>>>>>>>>>>> >>>> functions on collections, collection functions and 
>>>>>>>>>>>>>>>>> >>>> MAXWRITETIME?
>>>>>>>>>>>>>>>>> >>

Re: Aggregate functions on collections, collection functions and MAXWRITETIME

Reply via email to