[ https://issues.apache.org/jira/browse/SPARK-48937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uroš Bojanić updated SPARK-48937: --------------------------------- Description: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. was: Enable collation support for *StringToMap* built-in string function in Spark ({*}str_to_map{*}). First confirm what is the expected behaviour for this function when given collated strings, and then move on to implementation and testing. You will find this expression in the *complexTypeCreator.scala* file. However, this experssion is currently implemented as pass-through function, which is wrong because it doesn't provide appropriate collation awareness for non-default delimiters. Example 1. {code:java} SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} This query will give the correct result, regardless of the collation. {code:java} {"a":"1","b":"2","c":"3"}{code} Example 2. {code:java} SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} This query will give the *incorrect* result, under UTF8_LCASE or UNICODE_CI collation. The correct result should be: {code:java} {"a":"1","b":"2","c":"3"}{code} Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to reflect how this function should be used with collation in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with the existing functions to learn more about how they work. In addition, look into the possible use-cases and implementation of similar functions within other other open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. The goal for this Jira ticket is to implement the *StringToMap* expression so that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. StringTypeBinaryLcase). To understand what changes were introduced in order to enable full collation support for other existing functions in Spark, take a look at the related Spark PRs and Jira tickets for completed tasks in this parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). Read more about ICU [Collation Concepts|http://example.com/] and [Collator|http://example.com/] class. Also, refer to the Unicode Technical Standard for string [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. > Fix collation support for the StringToMap expression > ---------------------------------------------------- > > Key: SPARK-48937 > URL: https://issues.apache.org/jira/browse/SPARK-48937 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 4.0.0 > Reporter: Uroš Bojanić > Priority: Major > > Enable collation support for *StringToMap* built-in string function in Spark > ({*}str_to_map{*}). First confirm what is the expected behaviour for this > function when given collated strings, and then move on to implementation and > testing. You will find this expression in the *complexTypeCreator.scala* > file. However, this experssion is currently implemented as pass-through > function, which is wrong because it doesn't provide appropriate collation > awareness for non-default delimiters. > > Example 1. > {code:java} > SELECT str_to_map('a:1,b:2,c:3', ',', ':');{code} > This query will give the correct result, regardless of the collation. > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Example 2. > {code:java} > SELECT str_to_map('ay1xby2xcy3', 'X', 'Y');{code} > This query will give the *incorrect* result, under UTF8_LCASE collation. The > correct result should be: > {code:java} > {"a":"1","b":"2","c":"3"}{code} > > Update the corresponding E2E SQL tests (CollationSQLExpressionsSuite) to > reflect how this function should be used with collation in SparkSQL, and feel > free to use your chosen Spark SQL Editor to experiment with the existing > functions to learn more about how they work. In addition, look into the > possible use-cases and implementation of similar functions within other other > open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *StringToMap* expression so > that it supports UTF8_BINARY and UTF8_LCASE collations (i.e. > StringTypeBinaryLcase). To understand what changes were introduced in order > to enable full collation support for other existing functions in Spark, take > a look at the related Spark PRs and Jira tickets for completed tasks in this > parent (for example: https://issues.apache.org/jira/browse/SPARK-47414). > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for string > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org