[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

Jira Thu, 18 Jul 2024 05:48:29 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uroš Bojanić updated SPARK-48937:
---------------------------------
    Description: 
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE collation. The 
correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for *StringToMap* built-in string function in Spark 
({*}str_to_map{*}). First confirm what is the expected behaviour for this 
function when given collated strings, and then move on to implementation and 
testing. You will find this expression in the *complexTypeCreator.scala* file. 
However, this experssion is currently implemented as pass-through function, 
which is wrong because it doesn't provide appropriate collation awareness for 
non-default delimiters.

 

Example 1.
{code:java}
SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
This query will give the correct result, regardless of the collation.
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Example 2.
{code:java}
SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI 
collation. The correct result should be:
{code:java}
{"a":"1","b":"2","c":"3"}{code}
 

Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringToMap* expression so 
that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
StringTypeBinaryLcase). To understand what changes were introduced in order to 
enable full collation support for other existing functions in Spark, take a 
look at the related Spark PRs and Jira tickets for completed tasks in this 
parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class. Also, refer to the Unicode Technical 
Standard for string 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> Fix collation support for the StringToMap expression
> ----------------------------------------------------
>
>                 Key: SPARK-48937
>                 URL: https://issues.apache.org/jira/browse/SPARK-48937
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>
> Enable collation support for *StringToMap* built-in string function in Spark 
> ({*}str_to_map{*}). First confirm what is the expected behaviour for this 
> function when given collated strings, and then move on to implementation and 
> testing. You will find this expression in the *complexTypeCreator.scala* 
> file. However, this experssion is currently implemented as pass-through 
> function, which is wrong because it doesn't provide appropriate collation 
> awareness for non-default delimiters.
>  
> Example 1.
> {code:java}
> SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code}
> This query will give the correct result, regardless of the collation.
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Example 2.
> {code:java}
> SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code}
> This query will give the *incorrect* result, under UTF8_LCASE collation. The 
> correct result should be:
> {code:java}
> {"a":"1","b":"2","c":"3"}{code}
>  
> Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to 
> reflect how this function should be used with collation in SparkSQL, and feel 
> free to use your chosen Spark SQL Editor to experiment with the existing 
> functions to learn more about how they work. In addition, look into the 
> possible use-cases and implementation of similar functions within other other 
> open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringToMap* expression so 
> that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. 
> StringTypeBinaryLcase). To understand what changes were introduced in order 
> to enable full collation support for other existing functions in Spark, take 
> a look at the related Spark PRs and Jira tickets for completed tasks in this 
> parent (for example: https://issues.apache.org/jira/browse/SPARK-47414).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for string 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48937) Fix collation support for the StringToMap expression

Reply via email to