[jira] [Commented] (SPARK-47353) Mode (all collations)

Gideon P (Jira) Sun, 05 May 2024 09:13:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843574#comment-17843574
 ]


Gideon P commented on SPARK-47353:
----------------------------------

[~uros-db] Mode uses an accumulating OpenHashMap to determine the count of each 
unique element. 

Currently, the Apache Spark Mode function uses OpenHashMap to track occurrences 
of each key. However, with collation ordering (where multiple keys might 
compare as equal), using a direct hash map will not work effectively since 
different keys will need to be treated as the same. 

A few approaches to handle collations come to mind
1. Modify implementation `Mode.eval` to combine the map further. Perhaps by 
turning the map into a list of key-value tuples and folding. If the last 
element of the accumulating list and the current element being folded are equal 
according to collation, combine their counts 
2. Another way to modify implementation `Mode.eval` to combine the map further 
would be to add all the elements of the buffer to a TreeMap with Comparator. A 
TreeMap can efficiently keep track of values and their counts in a sorted 
manner using a collation-sensitive comparator.  
3. Use a TreeMap instead of OpenHashMap during the accumulation stage. Create a 
trait similar to TypedAggregateWithHashMapAsBuffer. Switch to use of this 
whenever both datatype of column is StringType and we are using a session 
collation. Would implement TypedImperativeAggregate. 

To start, I will try approach number 2.

Please let me know if I am on the right track and if you have any ideas! 

> Mode (all collations)
> ---------------------
>
>                 Key: SPARK-47353
>                 URL: https://issues.apache.org/jira/browse/SPARK-47353
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>
> Enable collation support for the *Mode* expression in Spark. First confirm 
> what is the expected behaviour for this expression when given collated 
> strings, then move on to the implementation that would enable handling 
> strings of all collation types. Implement the corresponding unit tests and 
> E2E SQL tests to reflect how this function should be used with collation in 
> SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *Mode* expression so it 
> supports all collation types currently supported in Spark. To understand what 
> changes were introduced in order to enable full collation support for other 
> existing functions in Spark, take a look at the Spark PRs and Jira tickets 
> for completed tasks in this parent (for example: Contains, StartsWith, 
> EndsWith).
> Examples:
> With UTF8_BINARY collation, the query
> SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) 
> AS tab(col);
> should return 'a'.
> With UTF8_BINARY_LCASE collation, the query
> SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) 
> AS tab(col);
> should return either 'B' or 'b'.
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47353) Mode (all collations)

Reply via email to