[ 
https://issues.apache.org/jira/browse/SPARK-47477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47477:
---------------------------------
    Description: 
Enable collation support for the *StringInstr* and *FindInSet* built-in string 
functions in Spark. First confirm what is the expected behaviour for these 
functions when given collated strings, and then move on to implementation and 
testing. One way to go about this is to consider using {_}StringSearch{_}, an 
efficient ICU service for string matching. Implement the corresponding unit 
tests (CollationStringExpressionsSuite) and E2E tests (CollationSuite) to 
reflect how this function should be used with collation in SparkSQL, and feel 
free to use your chosen Spark SQL Editor to experiment with the existing 
functions to learn more about how they work. In addition, look into the 
possible use-cases and implementation of similar functions within other other 
open-source DBMS, such as [PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringInstr* and *FindInSet* 
functions so that they support all collation types currently supported in 
Spark. To understand what changes were introduced in order to enable full 
collation support for other existing functions in Spark, take a look at the 
Spark PRs and Jira tickets for completed tasks in this parent (for example: 
Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Enable collation support for the *StringInstr* and *FindInSet* built-in string 
functions in Spark. One way to go about this is to consider using 
{_}StringSearch{_}, an efficient ICU service for string matching. Implement the 
corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
(CollationSuite) to reflect how this function should be used with collation in 
SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment with 
the existing functions to learn more about how they work. In addition, look 
into the possible use-cases and implementation of similar functions within 
other other open-source DBMS, such as 
[PostgreSQL|https://www.postgresql.org/docs/].

 

The goal for this Jira ticket is to implement the *StringInstr* and *FindInSet* 
functions so that they support all collation types currently supported in 
Spark. To understand what changes were introduced in order to enable full 
collation support for other existing functions in Spark, take a look at the 
Spark PRs and Jira tickets for completed tasks in this parent (for example: 
Contains, StartsWith, EndsWith).

 

Read more about ICU [Collation Concepts|http://example.com/] and 
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU 
user 
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and [ICU 
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> SubstringIndex, StringLocate (all collations)
> ---------------------------------------------
>
>                 Key: SPARK-47477
>                 URL: https://issues.apache.org/jira/browse/SPARK-47477
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>
> Enable collation support for the *StringInstr* and *FindInSet* built-in 
> string functions in Spark. First confirm what is the expected behaviour for 
> these functions when given collated strings, and then move on to 
> implementation and testing. One way to go about this is to consider using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement 
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringInstr* and 
> *FindInSet* functions so that they support all collation types currently 
> supported in Spark. To understand what changes were introduced in order to 
> enable full collation support for other existing functions in Spark, take a 
> look at the Spark PRs and Jira tickets for completed tasks in this parent 
> (for example: Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class, as well as _StringSearch_ using the 
> [ICU user 
> guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and [ICU 
> docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to