andygrove opened a new issue, #4121:
URL: https://github.com/apache/datafusion-comet/issues/4121
## Describe the bug
Spark allows storing arbitrary byte sequences in `STRING` columns, including
bytes that are not valid UTF-8 (for example via `CAST(X'C1' AS STRING)` and
`CAST(X'80' AS STRING)`). Comet's native Parquet scan rejects these rows with:
```
org.apache.comet.CometNativeException
Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
```
This surfaces in Spark 4.1.1's `hll.sql` (newly added in 4.1) at query #10:
```sql
SELECT hll_sketch_estimate(hll_sketch_agg(s)) utf8_b FROM hll_string_test;
```
where `hll_string_test` is populated with `INSERT INTO hll_string_test
VALUES (''), (' '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ...` to
exercise different collations including invalid Unicode bytes.
## Steps to reproduce
Run Spark 4.1.1's SQL test suite with Comet enabled (the `Spark SQL Tests`
matrix entry for 4.1.1). The test `hll.sql` in `SQLQueryTestSuite` fails.
## Expected behavior
Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way
Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet
data contains invalid UTF-8.
## Workaround
`hll.sql` is currently disabled when Comet is enabled via `--SET
spark.comet.enabled = false` at the top of the file in `dev/diffs/4.1.1.diff`
(introduced as part of the Spark 4.1.1 SQL test matrix landing).
## Additional context
PR #4093 enables Spark 4.1.1 in the `Spark SQL Tests` workflow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]