andygrove opened a new pull request, #4461: URL: https://github.com/apache/datafusion-comet/pull/4461
## Which issue does this PR close? <!-- No dedicated tracking issue; this is one of the rolling per-category audits. --> Closes #. ## Rationale for this change Following the same pattern as #4436 (`any`), #4437 (`bit_and`), and the earlier date/time audit, this PR audits every supported string expression in Comet against Spark 3.4.3, 3.5.8, and 4.0.1, records the findings inline in the support doc, and applies the support-level consistency fixes the audit surfaced. ## What changes are included in this PR? Add per-version audit sub-bullets to all 37 supported string SQL function names in `docs/source/contributor-guide/spark_expressions_support.md`. The notes record: - whether the Spark class is identical across 3.4.3 / 3.5.8 / 4.0.1 - Spark 4.0 changes (collation widening via `StringTypeWithCollation`; `CollationSupport` routing for `Upper`/`Lower`/`InitCap` and the `startswith`/`endswith`/`contains`/`instr`/`replace`/`translate`/`trim` family; `NullIntolerant` -> `nullIntolerant: Boolean` refactor; `BinaryPad` rewrite for `lpad`/`rpad`) - known divergences between Comet and Spark (`initcap` hyphen handling; `replace` with empty search; `repeat` with negative count; `decode` charset and legacy-flag gaps; `BinaryType` wiring gaps in `bit_length` / `octet_length`; `regexp_replace` pos > 1 restriction) Apply the mechanical support-level consistency fixes surfaced by the audit (all in `spark/src/main/scala/org/apache/comet/serde/strings.scala`): - `CometInitCap`: `Incompatible(None)` -> `Incompatible(Some(reason))` via a shared `private val`; drop the dead `convert` override. - `CometLength`: extract the `BinaryType` reason into a shared `private val` so the `getSupportLevel` branch and `getUnsupportedReasons` stay in sync. - `CometRegExpReplace`: pass the existing reasons into `Incompatible(Some(...))` / `Unsupported(Some(...))` via shared `private val`s so EXPLAIN matches the compatibility-guide text. - `CometStringLPad` / `CometStringRPad`: share the two reason strings via a small `PadReasons` companion so `getSupportLevel` and `getUnsupportedReasons` cannot drift; align wording with backticks. - `CometSubstring` / `CometLeft` / `CometRight`: lift the literal-argument fallback out of `convert` and into `getSupportLevel` so the dispatcher handles it uniformly; replace the convert-time `withInfo` with a declared `Unsupported` reason via a shared `private val`. The audit was driven by 7 parallel agents using the `audit-comet-expression` skill, each handling a related group (case conversion + initcap; length family; trim; left/right/substring/substring_index; lpad/rpad; simple scalar funcs; complex serdes including concat_ws/regexp_replace/repeat/split/decode). Higher-risk findings (e.g. lifting `spark.comet.caseConversion.enabled` gating out of `CometCaseConversionBase.convert`, promoting `CometStringRepeat` from `Compatible` to `Incompatible`, adding non-default-collation guards for the simple scalar string funcs) are intentionally left as prose notes in the support doc rather than acted on here, since each carries a semantics decision that warrants its own PR. ## How are these changes tested? - `./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite expressions/string/" -Dtest=none` (42 tests pass) - `./mvnw test -Dsuites="org.apache.comet.CometStringExpressionSuite" -Dtest=none` (33 tests pass) - Updated the affected fallback-reason assertions in `left.sql`, `string_lpad.sql`, `string_rpad.sql`, and `CometStringExpressionSuite.scala` to match the new shared reason strings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
