[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Uroš Bojanić updated SPARK-47410: --------------------------------- Description: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase. To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow: // 1. Collation-aware expression implementation CollationSupport.java * should serve as a static entry point for collation-aware expressions, providing custom support * for example: one by one Spark expression with corresponding collation support * also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods CollationFactory.java * should continue to serve as a static provider for high-level collation interface * for example: interacting with external ICU components such as Collator, StringSearch, etc. * also note that: no low-level / expression-specific code should generally be found here UTF8String.java * should be largely collation-unaware, and generally be used only as storage, nothing else * for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare) * also note that: no collation-aware operation implementations (using collationId) should be put here stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala) * should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport * for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc. * also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId // 2. Collation-aware expression testing CollationSuite.scala * should be used for testing more general collation concepts * for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc. * also note that: no extra tests should generally be added CollationSupportSuite.java * should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases * for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU) * also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala * should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour * for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc. * also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala // 3. Closing notes * Carefully think about performance implications of newly added custom collation-aware expression implementation * for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.) * also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can) * Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.) * for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc. * also note that: all similar tests should be uniform & readable and be kept in one place for various expressions * Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.) * for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions * also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes > refactor UTF8String and CollationFactory > ---------------------------------------- > > Key: SPARK-47410 > URL: https://issues.apache.org/jira/browse/SPARK-47410 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 4.0.0 > Reporter: Uroš Bojanić > Priority: Major > Labels: pull-request-available > > This ticket addresses the need to refactor the {{UTF8String}} and > {{CollationFactory}} classes within Spark to enhance support for > collation-aware expressions. The goal is to improve code structure, > maintainability, readability, and testing coverage for collation-aware Spark > SQL expressions. > The changed introduced herein should simplify addition of new collation-aware > operations and ensure consistent testing across the codebase. > > To further support the addition of collation support for new Spark > expressions, here are a couple of guidelines to follow: > > // 1. Collation-aware expression implementation > CollationSupport.java > * should serve as a static entry point for collation-aware expressions, > providing custom support > * for example: one by one Spark expression with corresponding collation > support > * also note that: CollationAwareUTF8String should be used for > collation-aware UTF8String operations & other utility methods > CollationFactory.java > * should continue to serve as a static provider for high-level collation > interface > * for example: interacting with external ICU components such as Collator, > StringSearch, etc. > * also note that: no low-level / expression-specific code should generally > be found here > UTF8String.java > * should be largely collation-unaware, and generally be used only as > storage, nothing else > * for example: don’t change this class at all (with the only one-time > exception of: semanticEquals/Compare) > * also note that: no collation-aware operation implementations (using > collationId) should be put here > stringExpressions.scala / regexpExpressions.scala / other > “sql.catalyst.expressions” (for example: Between.scala) > * should only contain minimal changes in order to re-route collation-aware > implementations to CollationSupport > * for example: most changes should be in relation to: adding collationId, > using correct data types, replacements, etc. > * also note that: nullSafeEval & doGenCode should likely note contain > introduce extra branching based on collationId > > // 2. Collation-aware expression testing > CollationSuite.scala > * should be used for testing more general collation concepts > * for example: collate/collation expressions, collation names, DDL, casting, > aggregate, shuffle, join, etc. > * also note that: no extra tests should generally be added > CollationSupportSuite.java > * should be used for expression unit tests, these tests should be as > rigorous as possible in order to cover various cases > * for example: unit tests that test collation-aware expression > implementation for various collations (binary, lowercase, ICU) > * also note that: these tests should generally be written after adding > appropriate expression support in CollationSupport.java > CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala > / CollationExpressionSuite.scala > * should be used for expression end-to-end tests, these tests should only > cover crucial expression behaviour > * for example: SQL tests that verify query execution results, expected > return data types, casting, unsupported collation handling, etc. > * also note that: these tests should generally be written after enabling > appropriate expression support in stringExpressions.scala > > // 3. Closing notes > * Carefully think about performance implications of newly added custom > collation-aware expression implementation > * for example: be very careful with extra string allocations (UTF8Strings -> > (Java) String -> UTF8Strings, etc.) > * also note that: some operations introduce very heavy performance penalties > (we should avoid the ones we can) > * Make sure to test all newly added expressions and completely (unit tests, > end-to-end tests, etc.) > * for example: consider edge, such as: empty strings, uppercase and > lowercase mix, different byte-length chars, etc. > * also note that: all similar tests should be uniform & readable and be kept > in one place for various expressions > * Consider how new expressions interact with the rest of the system > (casting; collation support level - use correct AbstractStringType, etc.) > * for example: we should watch out for casting, test it thoroughly, and use > CollationTypeCasts if needed for new expressions > * also note that: some expressions won’t support all collation types, so > that should be clearly reflected in tests and dataTypes -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org