andygrove opened a new issue, #4463:
URL: https://github.com/apache/datafusion-comet/issues/4463
## Describe the bug
`translate` is wired as `CometScalarFunction("translate")` and currently
reports `Compatible`, but DataFusion's `translate` diverges from Spark in two
ways:
1. **Grapheme vs code-point semantics.** DataFusion iterates over Unicode
graphemes; Spark uses code points (via `Character.charCount`). For
supplementary BMP code points these match, but for multi-code-point graphemes
(combining marks, ZWJ sequences such as flag emoji) the two implementations
disagree.
2. **NUL byte in the `to` argument.** Spark's `StringTranslate.buildDict`
treats any character mapped to U+0000 in `to` as a deletion. DataFusion
substitutes U+0000 instead.
Surfaced by the string-expressions audit in apache/datafusion-comet#4461.
## Steps to reproduce
```sql
-- (1) grapheme vs code point: combining mark
SELECT translate(concat('e', char(0x0301)), 'e', 'a');
-- (2) U+0000 deletion: expected to delete 'b' under Spark
SELECT translate('abc', 'b', char(0));
```
Spark deletes the matched character in the second query; Comet substitutes a
NUL character. Spark's per-code-point translation and Comet's grapheme-based
translation diverge for combining-mark inputs.
## Expected behavior
Match Spark behavior, or downgrade `translate` to `Incompatible(Some(...))`
so the non-ASCII path falls back unless explicitly enabled.
## Additional context
- Comet wiring: `QueryPlanSerde.scala` -> `classOf[StringTranslate] ->
CometScalarFunction("translate")`
- Spark reference: `UTF8String.translate(dict)` with
`StringTranslate.buildDict`
- DataFusion impl: `datafusion-functions::unicode::translate`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]