andygrove opened a new pull request, #3238: URL: https://github.com/apache/datafusion-comet/pull/3238
## Summary - Renames `spark.comet.scan.allowIncompatible` to `spark.comet.scan.unsignedSmallIntSafetyCheck` - Changes default from `false` to `true` (safety check enabled by default) - Removes `ByteType` from the safety check, leaving only `ShortType` subject to fallback ## Why ByteType is Safe `ByteType` columns are **always safe** for native execution because: 1. **Parquet type mapping**: Spark's `ByteType` can only originate from signed `INT8` in Parquet. There is no unsigned 8-bit Parquet type (`UINT_8`) that maps to `ByteType`. 2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned `UINT_8` columns, Spark maps them to `ShortType` (16-bit), not `ByteType`. This is because `UINT_8` values (0-255) exceed the signed byte range (-128 to 127). 3. **Truncation preserves signed values**: When storing signed `INT8` in 8 bits, the truncation from any wider representation preserves the correct signed value due to two's complement representation. ## Why ShortType Needs the Safety Check `ShortType` columns may be problematic because: 1. **Ambiguous origin**: `ShortType` can come from either signed `INT16` (safe) or unsigned `UINT_8` (potentially incompatible). 2. **Different reader behavior**: Arrow-based readers like DataFusion respect the unsigned `UINT_8` logical type and read data as unsigned, while Spark ignores the logical type and reads as signed. This can produce different results for values 128-255. 3. **No metadata available**: At scan time, Comet cannot determine whether a `ShortType` column originated from `INT16` or `UINT_8`, so the safety check conservatively falls back to Spark for all `ShortType` columns. Users who know their data does not contain unsigned `UINT_8` columns can disable the safety check with `spark.comet.scan.unsignedSmallIntSafetyCheck=false`. ## Test plan - [x] Ran `CometExpressionSuite` - all 125 tests pass (123 succeeded, 2 canceled, 3 ignored) - [ ] CI validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
