[I] Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1) [datafusion-comet]

via GitHub Tue, 28 Apr 2026 04:47:30 -0700


andygrove opened a new issue, #4121:
URL: https://github.com/apache/datafusion-comet/issues/4121


   ## Describe the bug
   
   Spark allows storing arbitrary byte sequences in `STRING` columns, including 
bytes that are not valid UTF-8 (for example via `CAST(X'C1' AS STRING)` and 
`CAST(X'80' AS STRING)`). Comet's native Parquet scan rejects these rows with:
   
   ```
   org.apache.comet.CometNativeException
   Arrow: Parquet argument error: Parquet error: encountered non UTF-8 data
   ```
   
   This surfaces in Spark 4.1.1's `hll.sql` (newly added in 4.1) at query #10:
   
   ```sql
   SELECT hll_sketch_estimate(hll_sketch_agg(s)) utf8_b FROM hll_string_test;
   ```
   
   where `hll_string_test` is populated with `INSERT INTO hll_string_test 
VALUES (''), ('  '), (CAST(X'C1' AS STRING)), (CAST(X'80' AS STRING)), ...` to 
exercise different collations including invalid Unicode bytes.
   
   ## Steps to reproduce
   
   Run Spark 4.1.1's SQL test suite with Comet enabled (the `Spark SQL Tests` 
matrix entry for 4.1.1). The test `hll.sql` in `SQLQueryTestSuite` fails.
   
   ## Expected behavior
   
   Comet should either (a) accept invalid-UTF-8 bytes in STRING columns the way 
Spark does, or (b) fall back to Spark when reading STRING columns whose Parquet 
data contains invalid UTF-8.
   
   ## Workaround
   
   `hll.sql` is currently disabled when Comet is enabled via `--SET 
spark.comet.enabled = false` at the top of the file in `dev/diffs/4.1.1.diff` 
(introduced as part of the Spark 4.1.1 SQL test matrix landing).
   
   ## Additional context
   
   PR #4093 enables Spark 4.1.1 in the `Spark SQL Tests` workflow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Comet native scan rejects invalid UTF-8 byte sequences in STRING column (hll.sql on Spark 4.1) [datafusion-comet]

Reply via email to