[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

Jira Wed, 10 Apr 2024 23:30:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Uroš Bojanić updated SPARK-47410:
---------------------------------
    Description: 
This ticket addresses the need to refactor the {{UTF8String}} and 
{{CollationFactory}} classes within Spark to enhance support for 
collation-aware expressions. The goal is to improve code structure, 
maintainability, readability, and testing coverage for collation-aware Spark 
SQL expressions.

The changed introduced herein should simplify addition of new collation-aware 
operations and ensure consistent testing across the codebase.

 

To further support the addition of collation support for new Spark expressions, 
here are a couple of guidelines to follow:

 

// 1. Collation-aware expression implementation

CollationSupport.java
 * should serve as a static entry point for collation-aware expressions, 
providing custom support
 * for example: one by one Spark expression with corresponding collation support
 * also note that: CollationAwareUTF8String should be used for collation-aware 
UTF8String operations & other utility methods

CollationFactory.java
 * should continue to serve as a static provider for high-level collation 
interface
 * for example: interacting with external ICU components such as Collator, 
StringSearch, etc.
 * also note that: no low-level / expression-specific code should generally be 
found here

UTF8String.java
 * should be largely collation-unaware, and generally be used only as storage, 
nothing else
 * for example: don’t change this class at all (with the only one-time 
exception of: semanticEquals/Compare)
 * also note that: no collation-aware operation implementations (using 
collationId) should be put here

stringExpressions.scala / regexpExpressions.scala / other 
“sql.catalyst.expressions” (for example: Between.scala)
 * should only contain minimal changes in order to re-route collation-aware 
implementations to CollationSupport
 * for example: most changes should be in relation to: adding collationId, 
using correct data types, replacements, etc.
 * also note that: nullSafeEval & doGenCode should likely note contain 
introduce extra branching based on collationId

 

// 2. Collation-aware expression testing

CollationSuite.scala
 * should be used for testing more general collation concepts
 * for example: collate/collation expressions, collation names, DDL, casting, 
aggregate, shuffle, join, etc.
 * also note that: no extra tests should generally be added

CollationSupportSuite.java
 * should be used for expression unit tests, these tests should be as rigorous 
as possible in order to cover various cases
 * for example: unit tests that test collation-aware expression implementation 
for various collations (binary, lowercase, ICU)
 * also note that: these tests should generally be written after adding 
appropriate expression support in CollationSupport.java

CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / 
CollationExpressionSuite.scala
 * should be used for expression end-to-end tests, these tests should only 
cover crucial expression behaviour
 * for example: SQL tests that verify query execution results, expected return 
data types, casting, unsupported collation handling, etc.
 * also note that: these tests should generally be written after enabling 
appropriate expression support in stringExpressions.scala

 

// 3. Closing notes
 * Carefully think about performance implications of newly added custom 
collation-aware expression implementation
 * for example: be very careful with extra string allocations (UTF8Strings -> 
(Java) String -> UTF8Strings, etc.)
 * also note that: some operations introduce very heavy performance penalties 
(we should avoid the ones we can)
 * Make sure to test all newly added expressions and completely (unit tests, 
end-to-end tests, etc.)
 * for example: consider edge, such as: empty strings, uppercase and lowercase 
mix, different byte-length chars, etc.
 * also note that: all similar tests should be uniform & readable and be kept 
in one place for various expressions
 * Consider how new expressions interact with the rest of the system (casting; 
collation support level - use correct AbstractStringType, etc.)
 * for example: we should watch out for casting, test it thoroughly, and use 
CollationTypeCasts if needed for new expressions
 * also note that: some expressions won’t support all collation types, so that 
should be clearly reflected in tests and dataTypes

> refactor UTF8String and CollationFactory
> ----------------------------------------
>
>                 Key: SPARK-47410
>                 URL: https://issues.apache.org/jira/browse/SPARK-47410
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.0.0
>            Reporter: Uroš Bojanić
>            Priority: Major
>              Labels: pull-request-available
>
> This ticket addresses the need to refactor the {{UTF8String}} and 
> {{CollationFactory}} classes within Spark to enhance support for 
> collation-aware expressions. The goal is to improve code structure, 
> maintainability, readability, and testing coverage for collation-aware Spark 
> SQL expressions.
> The changed introduced herein should simplify addition of new collation-aware 
> operations and ensure consistent testing across the codebase.
>  
> To further support the addition of collation support for new Spark 
> expressions, here are a couple of guidelines to follow:
>  
> // 1. Collation-aware expression implementation
> CollationSupport.java
>  * should serve as a static entry point for collation-aware expressions, 
> providing custom support
>  * for example: one by one Spark expression with corresponding collation 
> support
>  * also note that: CollationAwareUTF8String should be used for 
> collation-aware UTF8String operations & other utility methods
> CollationFactory.java
>  * should continue to serve as a static provider for high-level collation 
> interface
>  * for example: interacting with external ICU components such as Collator, 
> StringSearch, etc.
>  * also note that: no low-level / expression-specific code should generally 
> be found here
> UTF8String.java
>  * should be largely collation-unaware, and generally be used only as 
> storage, nothing else
>  * for example: don’t change this class at all (with the only one-time 
> exception of: semanticEquals/Compare)
>  * also note that: no collation-aware operation implementations (using 
> collationId) should be put here
> stringExpressions.scala / regexpExpressions.scala / other 
> “sql.catalyst.expressions” (for example: Between.scala)
>  * should only contain minimal changes in order to re-route collation-aware 
> implementations to CollationSupport
>  * for example: most changes should be in relation to: adding collationId, 
> using correct data types, replacements, etc.
>  * also note that: nullSafeEval & doGenCode should likely note contain 
> introduce extra branching based on collationId
>  
> // 2. Collation-aware expression testing
> CollationSuite.scala
>  * should be used for testing more general collation concepts
>  * for example: collate/collation expressions, collation names, DDL, casting, 
> aggregate, shuffle, join, etc.
>  * also note that: no extra tests should generally be added
> CollationSupportSuite.java
>  * should be used for expression unit tests, these tests should be as 
> rigorous as possible in order to cover various cases
>  * for example: unit tests that test collation-aware expression 
> implementation for various collations (binary, lowercase, ICU)
>  * also note that: these tests should generally be written after adding 
> appropriate expression support in CollationSupport.java
> CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala 
> / CollationExpressionSuite.scala
>  * should be used for expression end-to-end tests, these tests should only 
> cover crucial expression behaviour
>  * for example: SQL tests that verify query execution results, expected 
> return data types, casting, unsupported collation handling, etc.
>  * also note that: these tests should generally be written after enabling 
> appropriate expression support in stringExpressions.scala
>  
> // 3. Closing notes
>  * Carefully think about performance implications of newly added custom 
> collation-aware expression implementation
>  * for example: be very careful with extra string allocations (UTF8Strings -> 
> (Java) String -> UTF8Strings, etc.)
>  * also note that: some operations introduce very heavy performance penalties 
> (we should avoid the ones we can)
>  * Make sure to test all newly added expressions and completely (unit tests, 
> end-to-end tests, etc.)
>  * for example: consider edge, such as: empty strings, uppercase and 
> lowercase mix, different byte-length chars, etc.
>  * also note that: all similar tests should be uniform & readable and be kept 
> in one place for various expressions
>  * Consider how new expressions interact with the rest of the system 
> (casting; collation support level - use correct AbstractStringType, etc.)
>  * for example: we should watch out for casting, test it thoroughly, and use 
> CollationTypeCasts if needed for new expressions
>  * also note that: some expressions won’t support all collation types, so 
> that should be clearly reflected in tests and dataTypes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory

Reply via email to