andygrove opened a new pull request, #3238:
URL: https://github.com/apache/datafusion-comet/pull/3238

   ## Summary
   
   - Renames `spark.comet.scan.allowIncompatible` to 
`spark.comet.scan.unsignedSmallIntSafetyCheck`
   - Changes default from `false` to `true` (safety check enabled by default)
   - Removes `ByteType` from the safety check, leaving only `ShortType` subject 
to fallback
   
   ## Why ByteType is Safe
   
   `ByteType` columns are **always safe** for native execution because:
   
   1. **Parquet type mapping**: Spark's `ByteType` can only originate from 
signed `INT8` in Parquet. There is no unsigned 8-bit Parquet type (`UINT_8`) 
that maps to `ByteType`.
   
   2. **UINT_8 maps to ShortType**: When Parquet files contain unsigned 
`UINT_8` columns, Spark maps them to `ShortType` (16-bit), not `ByteType`. This 
is because `UINT_8` values (0-255) exceed the signed byte range (-128 to 127).
   
   3. **Truncation preserves signed values**: When storing signed `INT8` in 8 
bits, the truncation from any wider representation preserves the correct signed 
value due to two's complement representation.
   
   ## Why ShortType Needs the Safety Check
   
   `ShortType` columns may be problematic because:
   
   1. **Ambiguous origin**: `ShortType` can come from either signed `INT16` 
(safe) or unsigned `UINT_8` (potentially incompatible).
   
   2. **Different reader behavior**: Arrow-based readers like DataFusion 
respect the unsigned `UINT_8` logical type and read data as unsigned, while 
Spark ignores the logical type and reads as signed. This can produce different 
results for values 128-255.
   
   3. **No metadata available**: At scan time, Comet cannot determine whether a 
`ShortType` column originated from `INT16` or `UINT_8`, so the safety check 
conservatively falls back to Spark for all `ShortType` columns.
   
   Users who know their data does not contain unsigned `UINT_8` columns can 
disable the safety check with 
`spark.comet.scan.unsignedSmallIntSafetyCheck=false`.
   
   ## Test plan
   
   - [x] Ran `CometExpressionSuite` - all 125 tests pass (123 succeeded, 2 
canceled, 3 ignored)
   - [ ] CI validation
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to