[
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uroš Bojanić updated SPARK-47418:
-
Description:
Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string
Spark functions using optimized lowercase comparison approach introduced by
[~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to the
latest design and code structure imposed by [~uros-db] in
https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation
support is introduced for Spark SQL expressions. In addition, review previous
Jira tickets under the current parent in order to understand how
*StringPredicate* expressions are currently used and tested in Spark:
* [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
* [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
* [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
These tickets should help you understand what changes were introduced in order
to enable collation support for these functions. Lastly, feel free to use your
chosen Spark SQL Editor to play around with the existing functions and learn
more about how they work.
The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE
implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith*
functions so that they use optimized lowercase comparison approach (following
the general logic in Nikola's PR), and benchmark the results accordingly. As
for testing, the currently existing unit test cases and end-to-end tests should
already fully cover the expected behaviour of *StringPredicate* expressions for
all collation types. In other words, the objective of this ticket is only to
enhance the internal implementation, without introducing any user-facing
changes to Spark SQL API.
Finally, feel free to refer to the Unicode Technical Standard for string
[searching|https://www.unicode.org/reports/tr10/#Searching] and
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string
> Spark functions using optimized lowercase comparison approach introduced by
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to
> the latest design and code structure imposed by [~uros-db] in
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation
> support is introduced for Spark SQL expressions. In addition, review previous
> Jira tickets under the current parent in order to understand how
> *StringPredicate* expressions are currently used and tested in Spark:
> * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
> * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
> * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in
> order to enable collation support for these functions. Lastly, feel free to
> use your chosen Spark SQL Editor to play around with the existing functions
> and learn more about how they work.
>
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith*
> functions so that they use optimized lowercase comparison approach (following
> the general logic in Nikola's PR), and benchmark the results accordingly. As
> for testing, the currently existing unit test cases and end-to-end tests
> should already fully cover the expected behaviour of *StringPredicate*
> expressions for all collation types. In other words, the objective of this
> ticket is only to enhance the internal implementation, without introducing
> any user-facing changes to Spark SQL API.
>
> Finally, feel free to refer to the Unicode Technical Standard for string
> [searching|https://www.unicode.org/reports/tr10/#Searching] and
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org