[jira] [Updated] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47418:
---
Labels: pull-request-available  (was: )

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
> Spark functions using optimized lowercase comparison approach introduced by 
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to 
> the latest design and code structure imposed by [~uros-db] in 
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
> support is introduced for Spark SQL expressions. In addition, review previous 
> Jira tickets under the current parent in order to understand how 
> *StringPredicate* expressions are currently used and tested in Spark:
>  * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
>  * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
>  * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in 
> order to enable collation support for these functions. Lastly, feel free to 
> use your chosen Spark SQL Editor to play around with the existing functions 
> and learn more about how they work.
>  
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
> functions so that they use optimized lowercase comparison approach (following 
> the general logic in Nikola's PR), and benchmark the results accordingly. As 
> for testing, the currently existing unit test cases and end-to-end tests 
> should already fully cover the expected behaviour of *StringPredicate* 
> expressions for all collation types. In other words, the objective of this 
> ticket is only to enhance the internal implementation, without introducing 
> any user-facing changes to Spark SQL API.
>  
> Finally, feel free to refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-11 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47418:
-
Description: 
Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
Spark functions using optimized lowercase comparison approach introduced by 
[~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to the 
latest design and code structure imposed by [~uros-db] in 
https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
support is introduced for Spark SQL expressions. In addition, review previous 
Jira tickets under the current parent in order to understand how 
*StringPredicate* expressions are currently used and tested in Spark:
 * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
 * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
 * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]

These tickets should help you understand what changes were introduced in order 
to enable collation support for these functions. Lastly, feel free to use your 
chosen Spark SQL Editor to play around with the existing functions and learn 
more about how they work.

 

The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
functions so that they use optimized lowercase comparison approach (following 
the general logic in Nikola's PR), and benchmark the results accordingly. As 
for testing, the currently existing unit test cases and end-to-end tests should 
already fully cover the expected behaviour of *StringPredicate* expressions for 
all collation types. In other words, the objective of this ticket is only to 
enhance the internal implementation, without introducing any user-facing 
changes to Spark SQL API.

 

Finally, feel free to refer to the Unicode Technical Standard for string 
[searching|https://www.unicode.org/reports/tr10/#Searching] and 
[collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
> Spark functions using optimized lowercase comparison approach introduced by 
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to 
> the latest design and code structure imposed by [~uros-db] in 
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
> support is introduced for Spark SQL expressions. In addition, review previous 
> Jira tickets under the current parent in order to understand how 
> *StringPredicate* expressions are currently used and tested in Spark:
>  * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
>  * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
>  * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in 
> order to enable collation support for these functions. Lastly, feel free to 
> use your chosen Spark SQL Editor to play around with the existing functions 
> and learn more about how they work.
>  
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
> functions so that they use optimized lowercase comparison approach (following 
> the general logic in Nikola's PR), and benchmark the results accordingly. As 
> for testing, the currently existing unit test cases and end-to-end tests 
> should already fully cover the expected behaviour of *StringPredicate* 
> expressions for all collation types. In other words, the objective of this 
> ticket is only to enhance the internal implementation, without introducing 
> any user-facing changes to Spark SQL API.
>  
> Finally, feel free to refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-11 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47418:
-
Summary: Optimize string predicate expressions for UTF8_BINARY_LCASE 
collation  (was: TBD)

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org