[jira] [Created] (SPARK-48432) Unnecessary Integer unboxing in UnivocityParser

2024-05-27 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48432:


 Summary: Unnecessary Integer unboxing in UnivocityParser
 Key: SPARK-48432
 URL: https://issues.apache.org/jira/browse/SPARK-48432
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


`tokenIndexArr` is created as an array of `java.lang.Integers`. However, it is 
used not only for the wrapped java parser, but also during parsing to identify 
the correct token index.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48245) Typo in `BadRecordException` class doc

2024-05-12 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-48245:
-
Summary: Typo in `BadRecordException` class doc  (was: Typo in 
`BadRecordException`)

> Typo in `BadRecordException` class doc
> --
>
> Key: SPARK-48245
> URL: https://issues.apache.org/jira/browse/SPARK-48245
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48245) Typo in `BadRecordException`

2024-05-12 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48245:


 Summary: Typo in `BadRecordException`
 Key: SPARK-48245
 URL: https://issues.apache.org/jira/browse/SPARK-48245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48191) Support UTF-32 for string encode and decode

2024-05-08 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48191:


 Summary: Support UTF-32 for string encode and decode
 Key: SPARK-48191
 URL: https://issues.apache.org/jira/browse/SPARK-48191
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


It already works, gotta enable it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48169) Use lazy BadRecordException cause for StaxXmlParser and JacksonParser

2024-05-07 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48169:


 Summary: Use lazy BadRecordException cause for StaxXmlParser and 
JacksonParser
 Key: SPARK-48169
 URL: https://issues.apache.org/jira/browse/SPARK-48169
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


For now since the https://issues.apache.org/jira/browse/SPARK-48143, the old 
constructor is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48166) Unwanted use of internal BadRecordException in VariantExpressionEvalUtils

2024-05-07 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-48166:
-
Summary: Unwanted use of internal BadRecordException in 
VariantExpressionEvalUtils  (was: Excessive use of internal BadRecordException 
in VariantExpressionEvalUtils)

> Unwanted use of internal BadRecordException in VariantExpressionEvalUtils
> -
>
> Key: SPARK-48166
> URL: https://issues.apache.org/jira/browse/SPARK-48166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Minor
>
> BadRecordException should not be used as user-facing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48166) Excessive use of internal BadRecordException in VariantExpressionEvalUtils

2024-05-07 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48166:


 Summary: Excessive use of internal BadRecordException in 
VariantExpressionEvalUtils
 Key: SPARK-48166
 URL: https://issues.apache.org/jira/browse/SPARK-48166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


BadRecordException should not be used as user-facing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48143) UnivocityParser is slow when parsing partially-malformed CSV in PERMISSIVE mode

2024-05-06 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48143:


 Summary: UnivocityParser is slow when parsing partially-malformed 
CSV in PERMISSIVE mode
 Key: SPARK-48143
 URL: https://issues.apache.org/jira/browse/SPARK-48143
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


Parsing partially-malformed CSV in permissive mode is slow due to heavy 
exception construction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48114) ErrorClassesJsonReader complies template regex on every template resolution

2024-05-03 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48114:


 Summary: ErrorClassesJsonReader complies template regex on every 
template resolution
 Key: SPARK-48114
 URL: https://issues.apache.org/jira/browse/SPARK-48114
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


`SparkRuntimeException` uses `SparkThrowableHelper`, which uses 
`ErrorClassesJsonReader` to create error message string from templates in 
`error-conditions.json`, but template regex is compiled on every 
`SparkRuntimeException` constructor invocation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite

2024-05-02 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-48072:
-
Summary: Test output is not descriptive for some Array comparisons in 
SQLQuerySuite  (was: Test output is not descriptive for parameterized DESCRIBE 
and EXPLAIN in SQLQuerySuite)

> Test output is not descriptive for some Array comparisons in SQLQuerySuite
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
>
> Actual and expected queries are not printed in the output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48072) Test output is not descriptive for parameterized DESCRIBE and EXPLAIN in SQLQuerySuite

2024-05-02 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-48072:
-
Description: Actual and expected queries are not printed in the output  
(was: Actual query is not printed in the output)

> Test output is not descriptive for parameterized DESCRIBE and EXPLAIN in 
> SQLQuerySuite
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
>
> Actual and expected queries are not printed in the output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48072) Test output is not descriptive for some Array comparisons in SQLQuerySuite

2024-05-02 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-48072:
-
Description: Actual and expected queries are not printed in the output when 
using `.sameElements`  (was: Actual and expected queries are not printed in the 
output)

> Test output is not descriptive for some Array comparisons in SQLQuerySuite
> --
>
> Key: SPARK-48072
> URL: https://issues.apache.org/jira/browse/SPARK-48072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Minor
>  Labels: pull-request-available
>
> Actual and expected queries are not printed in the output when using 
> `.sameElements`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48072) Test output is not descriptive for parameterized DESCRIBE and EXPLAIN in SQLQuerySuite

2024-05-01 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-48072:


 Summary: Test output is not descriptive for parameterized DESCRIBE 
and EXPLAIN in SQLQuerySuite
 Key: SPARK-48072
 URL: https://issues.apache.org/jira/browse/SPARK-48072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


Actual query is not printed in the output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47939) Parameterized queries fail for DESCRIBE & EXPLAIN w/ UNBOUND_SQL_PARAMETER error

2024-04-22 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-47939:


 Summary: Parameterized queries fail for DESCRIBE & EXPLAIN w/ 
UNBOUND_SQL_PARAMETER error
 Key: SPARK-47939
 URL: https://issues.apache.org/jira/browse/SPARK-47939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


*Succeeds:* scala> spark.sql("select ?", Array(1)).show();

*Fails:* spark.sql("describe select ?", Array(1)).show();

*Fails:* spark.sql("explain select ?", Array(1)).show();

Failures are of the form:

org.apache.spark.sql.catalyst.ExtendedAnalysisException: 
[UNBOUND_SQL_PARAMETER] Found the unbound parameter: _16. Please, fix `args` 
and provide a mapping of the parameter to either a SQL literal or collection 
constructor functions such as `map()`, `array()`, `struct()`. SQLSTATE: 42P02; 
line 1 pos 16; 'Project [unresolvedalias(posparameter(16))] +- OneRowRelation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Parent: SPARK-46837
Issue Type: Sub-task  (was: Bug)

> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)
Vladimir Golubev created SPARK-47863:


 Summary: endsWith and startsWith don't work correctly for some 
collations
 Key: SPARK-47863
 URL: https://issues.apache.org/jira/browse/SPARK-47863
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Vladimir Golubev


*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Description: 
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}

  was:
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.



Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.}}{*}{{{}{*}testEndsWith{*}.{}}}

{{{}The first passes, since it uses *StringSearch* directly, the second one 
does not.{}}}{{{}{}}}


> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> }}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47863) endsWith and startsWith don't work correctly for some collations

2024-04-15 Thread Vladimir Golubev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Golubev updated SPARK-47863:
-
Description: 
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}

  was:
*CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
{*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to compare 
prefixes/suffixes. This is not correct, since sometimes string parts 
(suffix/prefix) of different lengths are actually equal in context of 
case-insensitive and lower-case collations.

Example test cases that highlight the problem:

{{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
*CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
{{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
}}{*}{{CollationSupportSuite.{*}{}}}{{{}{*}testEndsWith{*}.{}}}

{{The first passes, since it uses *StringSearch* directly, the second one does 
not.}}


> endsWith and startsWith don't work correctly for some collations
> 
>
> Key: SPARK-47863
> URL: https://issues.apache.org/jira/browse/SPARK-47863
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Vladimir Golubev
>Priority: Major
>
> *CollationSupport.EndsWIth* and *CollationSupport.StartsWith* use 
> {*}CollationAwareUTF8String.matchAt{*}, which operates byte offsets to 
> compare prefixes/suffixes. This is not correct, since sometimes string parts 
> (suffix/prefix) of different lengths are actually equal in context of 
> case-insensitive and lower-case collations.
> Example test cases that highlight the problem:
> {{{}- *assertContains("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testContains{*}.{}}} 
> {{{}- *assertEndsWith("The İo", "i̇o", "UNICODE_CI", true);* for 
> *CollationSupportSuite.*{}}}{{{}{*}testEndsWith{*}.{}}}
> {{The first passes, since it uses *StringSearch* directly, the second one 
> does not.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47418) Optimize string predicate expressions for UTF8_BINARY_LCASE collation

2024-04-12 Thread Vladimir Golubev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836505#comment-17836505
 ] 

Vladimir Golubev commented on SPARK-47418:
--

I'll work on that.

> Optimize string predicate expressions for UTF8_BINARY_LCASE collation
> -
>
> Key: SPARK-47418
> URL: https://issues.apache.org/jira/browse/SPARK-47418
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>
> Implement {*}contains{*}, {*}startsWith{*}, and *endsWith* built-in string 
> Spark functions using optimized lowercase comparison approach introduced by 
> [~nikolamand-db] in [https://github.com/apache/spark/pull/45816]. Refer to 
> the latest design and code structure imposed by [~uros-db] in 
> https://issues.apache.org/jira/browse/SPARK-47410 to understand how collation 
> support is introduced for Spark SQL expressions. In addition, review previous 
> Jira tickets under the current parent in order to understand how 
> *StringPredicate* expressions are currently used and tested in Spark:
>  * [SPARK-47131|https://issues.apache.org/jira/browse/SPARK-47131]
>  * [SPARK-47248|https://issues.apache.org/jira/browse/SPARK-47248]
>  * [SPARK-47295|https://issues.apache.org/jira/browse/SPARK-47295]
> These tickets should help you understand what changes were introduced in 
> order to enable collation support for these functions. Lastly, feel free to 
> use your chosen Spark SQL Editor to play around with the existing functions 
> and learn more about how they work.
>  
> The goal for this Jira ticket is to improve the UTF8_BINARY_LCASE 
> implementation for the {*}contains{*}, {*}startsWith{*}, and *endsWith* 
> functions so that they use optimized lowercase comparison approach (following 
> the general logic in Nikola's PR), and benchmark the results accordingly. As 
> for testing, the currently existing unit test cases and end-to-end tests 
> should already fully cover the expected behaviour of *StringPredicate* 
> expressions for all collation types. In other words, the objective of this 
> ticket is only to enhance the internal implementation, without introducing 
> any user-facing changes to Spark SQL API.
>  
> Finally, feel free to refer to the Unicode Technical Standard for string 
> [searching|https://www.unicode.org/reports/tr10/#Searching] and 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org