[jira] [Updated] (SPARK-47295) startswith, endswith (non-binary collations)

Jira Wed, 06 Mar 2024 00:49:18 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-47295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uroš Bojanić updated SPARK-47295:
---------------------------------
    Description: 
Implement *startsWith* and *endsWith* built-in string Spark functions using 
{_}StringSearch{_}, an efficient ICU service for string matching. Refer to the 
latest unit tests in CollationSuite to understand how these functions are used 
in SparkSQL, and feel free to use your chosen Spark SQL Editor to play around 
with the existing functions to learn more about how they work.

 

Currently, these 2 functions support all collation types:
 # binary collations (UCS_BASIC, UNICODE) *special cases - these collation 
types work using the existing string comparison functions - i.e. contains(), 
startsWith(), endsWith()
 # special lowercase non-binary collations (UCS_BASIC) *special case - these 
collation types work by using lower() to convert both strings to lowercase, and 
then use above functions
 # other non-binary collations (UNICODE_CI; special collations for various 
languages with case and accent sensitivity) - these collation types usually 
require special handling, which can sometimes be complex

 

To understand what changes were introduced in order to enable collation support 
for these functions, take a look at the Spark PRs and Jira tickets below:
 * [https://github.com/apache/spark/pull/45216] this PR enables:
 ** partial collation support for *contains* (skipping the 3rd type of 
collations shown above)
 ** complete collation support for {*}startsWith{*}, *endsWith* (using a 
special _matchAt_ implementation directly in {_}UTF8String{_})

 * [https://github.com/apache/spark/pull/45382] this PR enables:
 ** complete collation support for *contains* (using {_}StringSearch{_}) _-> 
now we should also use this approach for startsWith & endsWith_

 

Focusing on the 3rd type of collations as shown above, the goal for this Jira 
ticket is to re-implement the *startsWith* and *endsWith* functions so that 
they use _StringSearch_ instead (following the general logic in the second PR). 
As for the current test cases in CollationSuite, they should already mostly 
cover the expected behaviour of *startsWith* and *endsWith* for the 3rd type of 
collations.

 

Read more about StringSearch 
[here|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and 
[here|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

  was:
Implement *startsWith* and *endsWith* built-in string Spark functions using 
{_}StringSearch{_}, an efficient ICU service for string matching. Currently, 
these 2 functions support all collation types:
 # binary collations (UCS_BASIC, UNICODE) *special cases - these collation 
types work using the existing string comparison functions - i.e. contains(), 
startsWith(), endsWith()
 # special lowercase non-binary collations (UCS_BASIC) *special case - these 
collation types work by using lower() to convert both strings to lowercase, and 
then use above functions
 # other non-binary collations (UNICODE_CI; special collations for various 
languages with case and accent sensitivity) - these collation types usually 
require special handling, which can sometimes be complex

 

Refer to the latest unit tests in CollationSuite to understand how these 
functions are used in SparkSQL, and feel free to use your chosen Spark SQL 
Editor to play around with the existing functions to learn more about how they 
work. To understand what changes were introduced in order to enable collation 
support for these functions, take a look at the Spark PRs and Jira tickets 
below:
 * [https://github.com/apache/spark/pull/45216] this PR enables:
 ** partial collation support for *contains* (skipping the 3rd type of 
collations shown above)
 ** complete collation support for {*}startsWith{*}, *endsWith* (using a 
special _matchAt_ implementation directly in {_}UTF8String{_})

 * [https://github.com/apache/spark/pull/45382] this PR enables:
 ** complete collation support for *contains* (using {_}StringSearch{_}) _-> 
now we should also use this approach for startsWith & endsWith_

 

Focusing on the 3rd type of collations as shown above, the goal for this Jira 
ticket is to re-implement the *startsWith* and *endsWith* functions so that 
they use _StringSearch_ instead (following the general logic in the second PR). 
As for the current test cases in CollationSuite, they should already mostly 
cover the expected behaviour of *startsWith* and *endsWith* for the 3rd type of 
collations.

 

Read more about StringSearch 
[here|https://unicode-org.github.io/icu/userguide/collation/string-search.html] 
and 
[here|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
 Also, refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].


> startswith, endswith (non-binary collations)
> --------------------------------------------
>
>                 Key: SPARK-47295
>                 URL: https://issues.apache.org/jira/browse/SPARK-47295
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>
> Implement *startsWith* and *endsWith* built-in string Spark functions using 
> {_}StringSearch{_}, an efficient ICU service for string matching. Refer to 
> the latest unit tests in CollationSuite to understand how these functions are 
> used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play 
> around with the existing functions to learn more about how they work.
>  
> Currently, these 2 functions support all collation types:
>  # binary collations (UCS_BASIC, UNICODE) *special cases - these collation 
> types work using the existing string comparison functions - i.e. contains(), 
> startsWith(), endsWith()
>  # special lowercase non-binary collations (UCS_BASIC) *special case - these 
> collation types work by using lower() to convert both strings to lowercase, 
> and then use above functions
>  # other non-binary collations (UNICODE_CI; special collations for various 
> languages with case and accent sensitivity) - these collation types usually 
> require special handling, which can sometimes be complex
>  
> To understand what changes were introduced in order to enable collation 
> support for these functions, take a look at the Spark PRs and Jira tickets 
> below:
>  * [https://github.com/apache/spark/pull/45216] this PR enables:
>  ** partial collation support for *contains* (skipping the 3rd type of 
> collations shown above)
>  ** complete collation support for {*}startsWith{*}, *endsWith* (using a 
> special _matchAt_ implementation directly in {_}UTF8String{_})
>  * [https://github.com/apache/spark/pull/45382] this PR enables:
>  ** complete collation support for *contains* (using {_}StringSearch{_}) _-> 
> now we should also use this approach for startsWith & endsWith_
>  
> Focusing on the 3rd type of collations as shown above, the goal for this Jira 
> ticket is to re-implement the *startsWith* and *endsWith* functions so that 
> they use _StringSearch_ instead (following the general logic in the second 
> PR). As for the current test cases in CollationSuite, they should already 
> mostly cover the expected behaviour of *startsWith* and *endsWith* for the 
> 3rd type of collations.
>  
> Read more about StringSearch 
> [here|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
>  and 
> [here|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
>  Also, refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47295) startswith, endswith (non-binary collations)

Reply via email to