andygrove commented on issue #3362: URL: https://github.com/apache/datafusion-comet/issues/3362#issuecomment-4846199823
## Analysis: how this applies to Comet Worth noting that Comet's situation is the inverse of the DataFusion EPIC (apache/datafusion#19144), which makes the fix here cheaper. DataFusion has to reimplement Spark's per-expression nullability rules in Rust (via `return_field_from_args`) because it has no Spark to ask. Comet does have Spark: `expr.nullable` is already computed correctly on the JVM side, including `nullIntolerant` propagation, before serialization. So the natural fix for Comet is to propagate that value through the proto rather than re-deriving it natively. ### Current state - Most native expressions conservatively report `nullable() -> true`. - The only expression that propagates Spark's real nullability today is `JvmScalarUdf`, via the `return_nullable` field on its proto message (`expr.proto`), consumed in `jvm_udf/mod.rs`. - A version-agnostic `isNullIntolerant` shim already exists (`CometExprTraitShim`, handling the 3.x trait and the 4.x boolean method), but it is currently only used by the codegen dispatcher's null short-circuit, not for reporting native output nullability. ### Why it matters The risk is directional: - Over-reporting (`nullable=true` when actually non-null) is safe. It only misses optimizations. This is Comet's current behavior. - Under-reporting (`nullable=false`, then producing a null) is a hard failure: Arrow `RecordBatch` validation throws "declared as non-nullable but contains null values" downstream in shuffle/sort. That second failure mode has already been fixed one expression at a time, for example #4523 (GetStructField), #4533 (CreateArray), #4237 (array_except), and #4554 (codegen dispatcher honoring `ev.isNull` for nullable NullIntolerant roots). A systematic mechanism would retire this recurring class of bugs. ### Suggested approach 1. Generalize the `JvmScalarUdf` pattern into a `return_nullable` channel for general expressions (cleanest as a generic wrapper on the `Expr` proto message so every serde benefits, rather than per-message fields). 2. Set it from `expr.nullable` in the serde layer. 3. Consume it on the Rust side in `nullable()` and when building output `Field`s in `planner.rs`. ### One caveat Propagating `nullable=false` is only safe where Comet's native implementation provably matches Spark, otherwise a latent value divergence becomes a hard Arrow validation error the moment the field is marked non-nullable. Comet's current `nullable=true` default is the safe floor, so this likely wants to roll out incrementally and per-expression rather than as a global flip. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
