[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-12 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Description: 
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across the codebase.

 

To further support the addition of collation support for new Spark expressions, 
here are a couple of guidelines to follow:

 

// 1. Collation-aware expression implementation

CollationSupport.java
 * should serve as a static entry point for collation-aware expressions, 
providing custom support
 * for example: one by one Spark expression with corresponding collation support
 * also note that: CollationAwareUTF8String should be used for collation-aware 
UTF8String operations & other utility methods

CollationFactory.java
 * should continue to serve as a static provider for high-level collation 
interface
 * for example: interacting with external ICU components such as Collator, 
StringSearch, etc.
 * also note that: no low-level / expression-specific code should generally be 
found here

UTF8String.java
 * should be largely collation-unaware, and generally be used only as storage, 
nothing else
 * for example: don’t change this class at all (with the only one-time 
exception of: semanticEquals/Compare)
 * also note that: no collation-aware operation implementations (using 
collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other 
“sql.catalyst.expressions” (for example: Between.scala)
 * should only contain minimal changes in order to re-route collation-aware 
implementations to CollationSupport
 * for example: most changes should be in relation to: adding collationId, 
using correct data types, replacements, etc.
 * also note that: nullSafeEval & doGenCode should likely note contain 
introduce extra branching based on collationId

 

// 2. Collation-aware expression testing

CollationSuite.scala
 * should be used for testing more general collation concepts
 * for example: collate/collation expressions, collation names, DDL, casting, 
aggregate, shuffle, join, etc.
 * also note that: no extra tests should generally be added for newly supported 
string expressions

CollationSupportSuite.java
 * should be used for expression unit tests, these tests should be as rigorous 
as possible in order to cover various cases
 * for example: unit tests that test collation-aware expression implementation 
for various collations (binary, lowercase, ICU)
 * also note that: these tests should generally be written after adding 
appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / 
CollationExpressionSuite.scala
 * should be used for expression end-to-end tests, these tests should only 
cover crucial expression behaviour
 * for example: SQL tests that verify query execution results, expected return 
data types, casting, unsupported collation handling, etc.
 * also note that: these tests should generally be written after enabling 
appropriate expression support in stringExpressions.scala

 

// 3. Closing notes
 * Carefully think about performance implications of newly added custom 
collation-aware expression implementation
 * for example: be very careful with extra string allocations (UTF8Strings -> 
(Java) String -> UTF8Strings, etc.)
 * also note that: some operations introduce very heavy performance penalties 
(we should avoid the ones we can)

 
 * Make sure to test all newly added expressions and completely (unit tests, 
end-to-end tests, etc.)
 * for example: consider edge, such as: empty strings, uppercase and lowercase 
mix, different byte-length chars, etc.
 * also note that: all similar tests should be uniform & readable and be kept 
in one place for various expressions

 
 * Consider how new expressions interact with the rest of the system (casting; 
collation support level - use correct AbstractStringType, etc.)
 * for example: we should watch out for casting, test it thoroughly, and use 
CollationTypeCasts if needed for new expressions
 * also note that: some expressions won’t support all collation types, so that 
should be clearly reflected in tests and dataTypes

  was:
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 

[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-11 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Description: 
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across the codebase.

 

To further support the addition of collation support for new Spark expressions, 
here are a couple of guidelines to follow:

 

// 1. Collation-aware expression implementation

CollationSupport.java
 * should serve as a static entry point for collation-aware expressions, 
providing custom support
 * for example: one by one Spark expression with corresponding collation support
 * also note that: CollationAwareUTF8String should be used for collation-aware 
UTF8String operations & other utility methods

CollationFactory.java
 * should continue to serve as a static provider for high-level collation 
interface
 * for example: interacting with external ICU components such as Collator, 
StringSearch, etc.
 * also note that: no low-level / expression-specific code should generally be 
found here

UTF8String.java
 * should be largely collation-unaware, and generally be used only as storage, 
nothing else
 * for example: don’t change this class at all (with the only one-time 
exception of: semanticEquals/Compare)
 * also note that: no collation-aware operation implementations (using 
collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other 
“sql.catalyst.expressions” (for example: Between.scala)
 * should only contain minimal changes in order to re-route collation-aware 
implementations to CollationSupport
 * for example: most changes should be in relation to: adding collationId, 
using correct data types, replacements, etc.
 * also note that: nullSafeEval & doGenCode should likely note contain 
introduce extra branching based on collationId

 

// 2. Collation-aware expression testing

CollationSuite.scala
 * should be used for testing more general collation concepts
 * for example: collate/collation expressions, collation names, DDL, casting, 
aggregate, shuffle, join, etc.
 * also note that: no extra tests should generally be added

CollationSupportSuite.java
 * should be used for expression unit tests, these tests should be as rigorous 
as possible in order to cover various cases
 * for example: unit tests that test collation-aware expression implementation 
for various collations (binary, lowercase, ICU)
 * also note that: these tests should generally be written after adding 
appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / 
CollationExpressionSuite.scala
 * should be used for expression end-to-end tests, these tests should only 
cover crucial expression behaviour
 * for example: SQL tests that verify query execution results, expected return 
data types, casting, unsupported collation handling, etc.
 * also note that: these tests should generally be written after enabling 
appropriate expression support in stringExpressions.scala

 

// 3. Closing notes
 * Carefully think about performance implications of newly added custom 
collation-aware expression implementation
 * for example: be very careful with extra string allocations (UTF8Strings -> 
(Java) String -> UTF8Strings, etc.)
 * also note that: some operations introduce very heavy performance penalties 
(we should avoid the ones we can)

 
 * Make sure to test all newly added expressions and completely (unit tests, 
end-to-end tests, etc.)
 * for example: consider edge, such as: empty strings, uppercase and lowercase 
mix, different byte-length chars, etc.
 * also note that: all similar tests should be uniform & readable and be kept 
in one place for various expressions

 
 * Consider how new expressions interact with the rest of the system (casting; 
collation support level - use correct AbstractStringType, etc.)
 * for example: we should watch out for casting, test it thoroughly, and use 
CollationTypeCasts if needed for new expressions
 * also note that: some expressions won’t support all collation types, so that 
should be clearly reflected in tests and dataTypes

  was:
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across 

[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-11 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Description: 
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across the codebase.

 

To further support the addition of collation support for new Spark expressions, 
here are a couple of guidelines to follow:

 

// 1. Collation-aware expression implementation

CollationSupport.java
 * should serve as a static entry point for collation-aware expressions, 
providing custom support
 * for example: one by one Spark expression with corresponding collation support
 * also note that: CollationAwareUTF8String should be used for collation-aware 
UTF8String operations & other utility methods

CollationFactory.java
 * should continue to serve as a static provider for high-level collation 
interface
 * for example: interacting with external ICU components such as Collator, 
StringSearch, etc.
 * also note that: no low-level / expression-specific code should generally be 
found here

UTF8String.java
 * should be largely collation-unaware, and generally be used only as storage, 
nothing else
 * for example: don’t change this class at all (with the only one-time 
exception of: semanticEquals/Compare)
 * also note that: no collation-aware operation implementations (using 
collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other 
“sql.catalyst.expressions” (for example: Between.scala)
 * should only contain minimal changes in order to re-route collation-aware 
implementations to CollationSupport
 * for example: most changes should be in relation to: adding collationId, 
using correct data types, replacements, etc.
 * also note that: nullSafeEval & doGenCode should likely note contain 
introduce extra branching based on collationId

 

// 2. Collation-aware expression testing

CollationSuite.scala
 * should be used for testing more general collation concepts
 * for example: collate/collation expressions, collation names, DDL, casting, 
aggregate, shuffle, join, etc.
 * also note that: no extra tests should generally be added

CollationSupportSuite.java
 * should be used for expression unit tests, these tests should be as rigorous 
as possible in order to cover various cases
 * for example: unit tests that test collation-aware expression implementation 
for various collations (binary, lowercase, ICU)
 * also note that: these tests should generally be written after adding 
appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / 
CollationExpressionSuite.scala
 * should be used for expression end-to-end tests, these tests should only 
cover crucial expression behaviour
 * for example: SQL tests that verify query execution results, expected return 
data types, casting, unsupported collation handling, etc.
 * also note that: these tests should generally be written after enabling 
appropriate expression support in stringExpressions.scala

 

// 3. Closing notes
 * Carefully think about performance implications of newly added custom 
collation-aware expression implementation
 * for example: be very careful with extra string allocations (UTF8Strings -> 
(Java) String -> UTF8Strings, etc.)
 * also note that: some operations introduce very heavy performance penalties 
(we should avoid the ones we can)
 * Make sure to test all newly added expressions and completely (unit tests, 
end-to-end tests, etc.)
 * for example: consider edge, such as: empty strings, uppercase and lowercase 
mix, different byte-length chars, etc.
 * also note that: all similar tests should be uniform & readable and be kept 
in one place for various expressions
 * Consider how new expressions interact with the rest of the system (casting; 
collation support level - use correct AbstractStringType, etc.)
 * for example: we should watch out for casting, test it thoroughly, and use 
CollationTypeCasts if needed for new expressions
 * also note that: some expressions won’t support all collation types, so that 
should be clearly reflected in tests and dataTypes

> refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>
> This 

[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47410:
---
Labels: pull-request-available  (was: )

> refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

2024-04-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Summary: refactor UTF8String and CollationFactory  (was: Refactor 
UTF8String and CollationFactory)

> refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47410) Refactor UTF8String and CollationFactory

2024-04-08 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uroš Bojanić updated SPARK-47410:
-
Summary: Refactor UTF8String and CollationFactory  (was: StringTrimLeft, 
StringTrimRight (all collations))

> Refactor UTF8String and CollationFactory
> 
>
> Key: SPARK-47410
> URL: https://issues.apache.org/jira/browse/SPARK-47410
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org