samuelcolvin commented on issue #11162:
URL: https://github.com/apache/datafusion/issues/11162#issuecomment-2212326030

   See #11314 as a demonstration of the problem for both dense and sparse 
unions.
   
   After a bit of investigation, the issues lies in the first instance with
   
   
https://github.com/apache/datafusion/blob/08c5345e932f1c5c948751e0d06b1fd99e174efa/datafusion/physical-expr/src/expressions/is_null.rs#L74-L84
   
   Then with [this 
code](https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-arith/src/boolean.rs#L314-L332)
 in `arrow-rs`:
   
   ```rs
   /// Returns a non-null [BooleanArray] with whether each value of the array 
is null.
   /// # Error
   /// This function never errors.
   /// # Example
   /// ...
   pub fn is_null(input: &dyn Array) -> Result<BooleanArray, ArrowError> {
       let values = match input.logical_nulls() {
           None => BooleanBuffer::new_unset(input.len()),
           Some(nulls) => !nulls.inner(),
       };
   
       Ok(BooleanArray::new(values, None))
   }
   ```
   
   And then with [this 
code](https://github.com/apache/arrow-rs/blob/b9562b9550b8ff4aa7be9859e56e467b1a3b3de6/arrow-array/src/array/union_array.rs#L482-L486)
   
   ```rs
       /// Union types always return non null as there is no validity buffer.
       /// To check validity correctly you must check the underlying vector.
       fn is_null(&self, _index: usize) -> bool {
           false
       }
   ```
   
   Ultimately with [the 
spec](https://github.com/apache/arrow/blob/674e70891d1b3bc82b025d9c434d8ff1aa4c877e/docs/source/format/Columnar.rst?plain=1#L862-L864)
   
   > Unlike other data types, unions do not have their own validity bitmap. 
Instead,
   > the nullness of each slot is determined exclusively by the child arrays 
which
   > are composed to create the union.
   
   ---
   
   Basically arrow is saying "we're not going to tell you if a union is null, 
you need to look in the child arrays", but datafusion isn't listening and is 
just asking the union if it's null in the naive way.
   
   Two options to move forward as far as I can tell:
   1. Decide unions in DF can never be null — I'll need to abandon unions in 
`datafusion-functions-json` and just return strings everywhere
   2. Have custom logic for unions that looks up the child array to determine 
if the value is null
   
   If (as I hope) we go for the second option, there's also the issue (as 
demonstrated by #11314) that the representation of  "null" union items doesn't 
match other types, it shows `{A=}` instead of an empty string.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to