andygrove opened a new issue, #4465: URL: https://github.com/apache/datafusion-comet/issues/4465
## Describe the bug Spark 4.0 refactored `StringDecode` from a `BinaryExpression` to a `RuntimeReplaceable` whose `replacement` is `StaticInvoke(StringDecode.decode, bin, charset, legacyCharsets, legacyErrorAction)`. The two new boolean arguments control malformed-character handling: with `legacyErrorAction = true`, Spark substitutes replacement characters for invalid UTF-8 sequences (matching the Spark 3.x behavior); with `legacyErrorAction = false` (the default), Spark raises `QueryExecutionErrors.malformedCharacterCoding(...)`. Comet's Spark 4.0 shim (`spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala`) destructures the `StaticInvoke` arguments and discards both flags, then routes through `CommonStringExprs.stringDecode` which always lowers to `Cast(bin, StringType, TRY)`. The Cast TRY path produces NULL on invalid UTF-8 in all cases. That means: - Under Spark 4.0 default mode (`legacyErrorAction = false`): Spark raises, Comet returns NULL. - Under Spark 4.0 legacy mode (`legacyErrorAction = true`): Spark substitutes replacement characters, Comet returns NULL. - Under Spark 3.x: Spark substitutes replacement characters, Comet returns NULL. Surfaced by the string-expressions audit in apache/datafusion-comet#4461. ## Steps to reproduce ```sql SET spark.sql.legacy.javaCharsets = true; SELECT decode(X'FF', 'UTF-8'); ``` Spark 3.x: returns `?` (Unicode replacement). Spark 4.0 (legacy mode): same as 3.x. Spark 4.0 (default mode): raises `MALFORMED_CHARACTER_CODING`. Comet: returns NULL in all three cases. ## Expected behavior Honor `legacyCharsets` / `legacyErrorAction` when running under Spark 4.0+. At minimum, the flags should be propagated through the proto so the native impl can choose between the substitute/throw/null modes. ## Additional context - Shim location: `spark/src/main/spark-4.0/org/apache/comet/shims/CometExprShim.scala` (and `spark-4.1`, `spark-4.2`) - Helper: `CommonStringExprs.stringDecode` in `spark/src/main/scala/org/apache/comet/serde/strings.scala` - Related: #4465 (decode not surfaced in compatibility docs) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
