[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-47410: - Description: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase. To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow: // 1. Collation-aware expression implementation CollationSupport.java * should serve as a static entry point for collation-aware expressions, providing custom support * for example: one by one Spark expression with corresponding collation support * also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods CollationFactory.java * should continue to serve as a static provider for high-level collation interface * for example: interacting with external ICU components such as Collator, StringSearch, etc. * also note that: no low-level / expression-specific code should generally be found here UTF8String.java * should be largely collation-unaware, and generally be used only as storage, nothing else * for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare) * also note that: no collation-aware operation implementations (using collationId) should be put here stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala) * should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport * for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc. * also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId // 2. Collation-aware expression testing CollationSuite.scala * should be used for testing more general collation concepts * for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc. * also note that: no extra tests should generally be added for newly supported string expressions CollationSupportSuite.java * should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases * for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU) * also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala * should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour * for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc. * also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala // 3. Closing notes * Carefully think about performance implications of newly added custom collation-aware expression implementation * for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.) * also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can) * Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.) * for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc. * also note that: all similar tests should be uniform & readable and be kept in one place for various expressions * Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.) * for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions * also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes was: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operatio
[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-47410: - Description: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase. To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow: // 1. Collation-aware expression implementation CollationSupport.java * should serve as a static entry point for collation-aware expressions, providing custom support * for example: one by one Spark expression with corresponding collation support * also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods CollationFactory.java * should continue to serve as a static provider for high-level collation interface * for example: interacting with external ICU components such as Collator, StringSearch, etc. * also note that: no low-level / expression-specific code should generally be found here UTF8String.java * should be largely collation-unaware, and generally be used only as storage, nothing else * for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare) * also note that: no collation-aware operation implementations (using collationId) should be put here stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala) * should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport * for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc. * also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId // 2. Collation-aware expression testing CollationSuite.scala * should be used for testing more general collation concepts * for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc. * also note that: no extra tests should generally be added CollationSupportSuite.java * should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases * for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU) * also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala * should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour * for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc. * also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala // 3. Closing notes * Carefully think about performance implications of newly added custom collation-aware expression implementation * for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.) * also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can) * Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.) * for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc. * also note that: all similar tests should be uniform & readable and be kept in one place for various expressions * Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.) * for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions * also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes was: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across
[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-47410: - Description: This ticket addresses the need to refactor the {{UTF8String}} and {{CollationFactory}} classes within Spark to enhance support for collation-aware expressions. The goal is to improve code structure, maintainability, readability, and testing coverage for collation-aware Spark SQL expressions. The changed introduced herein should simplify addition of new collation-aware operations and ensure consistent testing across the codebase. To further support the addition of collation support for new Spark expressions, here are a couple of guidelines to follow: // 1. Collation-aware expression implementation CollationSupport.java * should serve as a static entry point for collation-aware expressions, providing custom support * for example: one by one Spark expression with corresponding collation support * also note that: CollationAwareUTF8String should be used for collation-aware UTF8String operations & other utility methods CollationFactory.java * should continue to serve as a static provider for high-level collation interface * for example: interacting with external ICU components such as Collator, StringSearch, etc. * also note that: no low-level / expression-specific code should generally be found here UTF8String.java * should be largely collation-unaware, and generally be used only as storage, nothing else * for example: don’t change this class at all (with the only one-time exception of: semanticEquals/Compare) * also note that: no collation-aware operation implementations (using collationId) should be put here stringExpressions.scala / regexpExpressions.scala / other “sql.catalyst.expressions” (for example: Between.scala) * should only contain minimal changes in order to re-route collation-aware implementations to CollationSupport * for example: most changes should be in relation to: adding collationId, using correct data types, replacements, etc. * also note that: nullSafeEval & doGenCode should likely note contain introduce extra branching based on collationId // 2. Collation-aware expression testing CollationSuite.scala * should be used for testing more general collation concepts * for example: collate/collation expressions, collation names, DDL, casting, aggregate, shuffle, join, etc. * also note that: no extra tests should generally be added CollationSupportSuite.java * should be used for expression unit tests, these tests should be as rigorous as possible in order to cover various cases * for example: unit tests that test collation-aware expression implementation for various collations (binary, lowercase, ICU) * also note that: these tests should generally be written after adding appropriate expression support in CollationSupport.java CollationStringExpressionsSuite.scala / CollationRegexpExpressionsSuite.scala / CollationExpressionSuite.scala * should be used for expression end-to-end tests, these tests should only cover crucial expression behaviour * for example: SQL tests that verify query execution results, expected return data types, casting, unsupported collation handling, etc. * also note that: these tests should generally be written after enabling appropriate expression support in stringExpressions.scala // 3. Closing notes * Carefully think about performance implications of newly added custom collation-aware expression implementation * for example: be very careful with extra string allocations (UTF8Strings -> (Java) String -> UTF8Strings, etc.) * also note that: some operations introduce very heavy performance penalties (we should avoid the ones we can) * Make sure to test all newly added expressions and completely (unit tests, end-to-end tests, etc.) * for example: consider edge, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc. * also note that: all similar tests should be uniform & readable and be kept in one place for various expressions * Consider how new expressions interact with the rest of the system (casting; collation support level - use correct AbstractStringType, etc.) * for example: we should watch out for casting, test it thoroughly, and use CollationTypeCasts if needed for new expressions * also note that: some expressions won’t support all collation types, so that should be clearly reflected in tests and dataTypes > refactor UTF8String and CollationFactory > > > Key: SPARK-47410 > URL: https://issues.apache.org/jira/browse/SPARK-47410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > > This ticke
[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47410: --- Labels: pull-request-available (was: ) > refactor UTF8String and CollationFactory > > > Key: SPARK-47410 > URL: https://issues.apache.org/jira/browse/SPARK-47410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47410) refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-47410: - Summary: refactor UTF8String and CollationFactory (was: Refactor UTF8String and CollationFactory) > refactor UTF8String and CollationFactory > > > Key: SPARK-47410 > URL: https://issues.apache.org/jira/browse/SPARK-47410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47410) Refactor UTF8String and CollationFactory
[ https://issues.apache.org/jira/browse/SPARK-47410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uroš Bojanić updated SPARK-47410: - Summary: Refactor UTF8String and CollationFactory (was: StringTrimLeft, StringTrimRight (all collations)) > Refactor UTF8String and CollationFactory > > > Key: SPARK-47410 > URL: https://issues.apache.org/jira/browse/SPARK-47410 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org