andygrove commented on issue #3362:
URL: 
https://github.com/apache/datafusion-comet/issues/3362#issuecomment-4846199823

   ## Analysis: how this applies to Comet
   
   Worth noting that Comet's situation is the inverse of the DataFusion EPIC 
(apache/datafusion#19144), which makes the fix here cheaper.
   
   DataFusion has to reimplement Spark's per-expression nullability rules in 
Rust (via `return_field_from_args`) because it has no Spark to ask. Comet does 
have Spark: `expr.nullable` is already computed correctly on the JVM side, 
including `nullIntolerant` propagation, before serialization. So the natural 
fix for Comet is to propagate that value through the proto rather than 
re-deriving it natively.
   
   ### Current state
   
   - Most native expressions conservatively report `nullable() -> true`.
   - The only expression that propagates Spark's real nullability today is 
`JvmScalarUdf`, via the `return_nullable` field on its proto message 
(`expr.proto`), consumed in `jvm_udf/mod.rs`.
   - A version-agnostic `isNullIntolerant` shim already exists 
(`CometExprTraitShim`, handling the 3.x trait and the 4.x boolean method), but 
it is currently only used by the codegen dispatcher's null short-circuit, not 
for reporting native output nullability.
   
   ### Why it matters
   
   The risk is directional:
   
   - Over-reporting (`nullable=true` when actually non-null) is safe. It only 
misses optimizations. This is Comet's current behavior.
   - Under-reporting (`nullable=false`, then producing a null) is a hard 
failure: Arrow `RecordBatch` validation throws "declared as non-nullable but 
contains null values" downstream in shuffle/sort.
   
   That second failure mode has already been fixed one expression at a time, 
for example #4523 (GetStructField), #4533 (CreateArray), #4237 (array_except), 
and #4554 (codegen dispatcher honoring `ev.isNull` for nullable NullIntolerant 
roots). A systematic mechanism would retire this recurring class of bugs.
   
   ### Suggested approach
   
   1. Generalize the `JvmScalarUdf` pattern into a `return_nullable` channel 
for general expressions (cleanest as a generic wrapper on the `Expr` proto 
message so every serde benefits, rather than per-message fields).
   2. Set it from `expr.nullable` in the serde layer.
   3. Consume it on the Rust side in `nullable()` and when building output 
`Field`s in `planner.rs`.
   
   ### One caveat
   
   Propagating `nullable=false` is only safe where Comet's native 
implementation provably matches Spark, otherwise a latent value divergence 
becomes a hard Arrow validation error the moment the field is marked 
non-nullable. Comet's current `nullable=true` default is the safe floor, so 
this likely wants to roll out incrementally and per-expression rather than as a 
global flip.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to