[
https://issues.apache.org/jira/browse/SPARK-47409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uroš Bojanić updated SPARK-47409:
-
Description:
Enable collation support for the *StringTrim* built-in string function in Spark
(including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, {*}StringTrimRight{*}).
First confirm what is the expected behaviour for these functions when given
collated strings, and then move on to implementation and testing. One way to go
about this is to consider using {_}StringSearch{_}, an efficient ICU service
for string matching. Implement the corresponding unit tests
(CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how
this function should be used with collation in SparkSQL, and feel free to use
your chosen Spark SQL Editor to experiment with the existing functions to learn
more about how they work. In addition, look into the possible use-cases and
implementation of similar functions within other other open-source DBMS, such
as [PostgreSQL|[https://www.postgresql.org/docs/]].
The goal for this Jira ticket is to implement the *StringTrim* function so it
supports binary & lowercase collation types currently supported in Spark. To
understand what changes were introduced in order to enable full collation
support for other existing functions in Spark, take a look at the Spark PRs and
Jira tickets for completed tasks in this parent (for example: Contains,
StartsWith, EndsWith).
Read more about ICU [Collation Concepts|http://example.com/] and
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU
user
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
and [ICU
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
Also, refer to the Unicode Technical Standard for string
[searching|https://www.unicode.org/reports/tr10/#Searching] and
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
was:
Enable collation support for the *StringTrim* built-in string function in Spark
(including {*}StringTrimBoth{*}, {*}StringTrimLeft{*}, {*}StringTrimRight{*}).
First confirm what is the expected behaviour for these functions when given
collated strings, and then move on to implementation and testing. One way to go
about this is to consider using {_}StringSearch{_}, an efficient ICU service
for string matching. Implement the corresponding unit tests
(CollationStringExpressionsSuite) and E2E tests (CollationSuite) to reflect how
this function should be used with collation in SparkSQL, and feel free to use
your chosen Spark SQL Editor to experiment with the existing functions to learn
more about how they work. In addition, look into the possible use-cases and
implementation of similar functions within other other open-source DBMS, such
as [PostgreSQL|[https://www.postgresql.org/docs/]].
The goal for this Jira ticket is to implement the *StringTrim* function so it
supports all collation types currently supported in Spark. To understand what
changes were introduced in order to enable full collation support for other
existing functions in Spark, take a look at the Spark PRs and Jira tickets for
completed tasks in this parent (for example: Contains, StartsWith, EndsWith).
Read more about ICU [Collation Concepts|http://example.com/] and
[Collator|http://example.com/] class, as well as _StringSearch_ using the [ICU
user
guide|https://unicode-org.github.io/icu/userguide/collation/string-search.html]
and [ICU
docs|https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/StringSearch.html].
Also, refer to the Unicode Technical Standard for string
[searching|https://www.unicode.org/reports/tr10/#Searching] and
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
> StringTrim & StringTrimLeft/Right/Both (binary & lowercase collation only)
> --
>
> Key: SPARK-47409
> URL: https://issues.apache.org/jira/browse/SPARK-47409
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
> Labels: pull-request-available
>
> Enable collation support for the *StringTrim* built-in string function in
> Spark (including {*}StringTrimBoth{*}, {*}StringTrimLeft{*},
> {*}StringTrimRight{*}). First confirm what is the expected behaviour for
> these functions when given collated strings, and then move on to
> implementation and testing. One way to go about this is to consider using
> {_}StringSearch{_}, an efficient ICU service for string matching. Implement
> the corresponding unit tests (CollationStringExpressionsSuite) and E2E tests