[ https://issues.apache.org/jira/browse/SPARK-47295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-47295: ----------------------------------- Labels: pull-request-available (was: ) > startswith, endswith (non-binary collations) > -------------------------------------------- > > Key: SPARK-47295 > URL: https://issues.apache.org/jira/browse/SPARK-47295 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 4.0.0 > Reporter: Uroš Bojanić > Priority: Major > Labels: pull-request-available > > Implement *startsWith* and *endsWith* built-in string Spark functions using > {_}StringSearch{_}, an efficient ICU service for string matching. Refer to > the latest unit tests in CollationSuite to understand how these functions are > used in SparkSQL, and feel free to use your chosen Spark SQL Editor to play > around with the existing functions to learn more about how they work. > > Currently, these 2 functions support all collation types: > # binary collations (UCS_BASIC, UNICODE) *special cases - these collation > types work using the existing string comparison functions - i.e. contains(), > startsWith(), endsWith() > # special lowercase non-binary collations (UCS_BASIC) *special case - these > collation types work by using lower() to convert both strings to lowercase, > and then use above functions > # other non-binary collations (UNICODE_CI; special collations for various > languages with case and accent sensitivity) - these collation types usually > require special handling, which can sometimes be complex > > To understand what changes were introduced in order to enable collation > support for these functions, take a look at the Spark PRs and Jira tickets > below: > * [https://github.com/apache/spark/pull/45216] this PR enables: > ** partial collation support for *contains* (skipping the 3rd type of > collations shown above) > ** complete collation support for {*}startsWith{*}, *endsWith* (using a > special _matchAt_ implementation directly in {_}UTF8String{_}) > * [https://github.com/apache/spark/pull/45382] this PR enables: > ** complete collation support for *contains* (using {_}StringSearch{_}) _-> > now we should also use this approach for startsWith & endsWith_ > > Focusing on the 3rd type of collations as shown above, the goal for this Jira > ticket is to re-implement the *startsWith* and *endsWith* functions so that > they use _StringSearch_ instead (following the general logic in the second > PR). As for the current test cases in CollationSuite, they should already > mostly cover the expected behaviour of *startsWith* and *endsWith* for the > 3rd type of collations. > > Read more about _StringSearch_ using the [ICU user > guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html] > and [ICU > docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html]. > Also, refer to the Unicode Technical Standard for string > [searching|https://www.unicode.org/reports/tr10/#Searching] and > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org