mbutrovich opened a new issue, #4130:
URL: https://github.com/apache/datafusion-comet/issues/4130

   ## Describe the proposed change
   
   `supportedRangePartitioningDataType` in
   
`spark/src/main/scala/org/apache/spark/sql/comet/execution/shuffle/CometShuffleExchangeExec.scala:344`
   admits `FloatType` and `DoubleType` unconditionally. It does not consult
   `spark.comet.exec.strictFloatingPoint`.
   
   Other ordering-dependent expressions already do:
   
   - `CometSortOrder` in 
`spark/src/main/scala/org/apache/comet/serde/CometSortOrder.scala:34`
   - `CometSortArray` in 
`spark/src/main/scala/org/apache/comet/serde/arrays.scala:150`
   
   Both return `Incompatible` when the ordered type contains Float/Double and
   `strictFloatingPoint=true`. `RangePartitioning` should follow the same 
pattern.
   
   ## Rationale
   
   Range partitioning samples rows, sorts the samples, picks split points, then
   buckets rows by those split points. Arrow's float ordering differs from 
Spark's
   on NaN and `-0.0` vs `0.0`:
   
   - Spark's `Double.compare`: NaN sorts largest, `-0.0 == 0.0`.
   - Arrow's `sort_to_indices`: `-0.0 < 0.0`, NaN at extremes.
   
   Split points chosen by Comet can therefore differ from those chosen by Spark,
   and rows containing NaN or `-0.0` can land in different partitions under
   Comet vs Spark. Users who care about strict Spark parity already set
   `strictFloatingPoint=true`, expecting Comet to fall back on ordering 
operations
   that are not bit-for-bit compatible. `RangePartitioning` currently ignores
   that contract.
   
   ## Additional context
   
   Found while reviewing #4076 (MapSort support on Spark 4.0), which has the 
same
   gap for its own expression. That PR addresses it locally for `MapSort`. This
   issue tracks the equivalent fix for `RangePartitioning`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to