Hello,

The Scalar base class has a `bool is_valid` member that is used to represent null scalars for all types (including the null type).

A UnionScalar, since it inherits from Scalar, has the following de facto structure:

{
  // whether the scalar is null
  bool is_valid;
  // the underlying union value (only set if `is_valid` is true)
  std::shared_ptr<Scalar> value;
}

However, union arrays don't have a top-level validity bitmap. A null in an union array is always encoded as a "valid" top-level entry with a null entry in the chosen child array.

Therefore, there are two possible conventions for representing null union scalars:

1) set `scalar.is_valid` to false and `scalar.value` to nullptr

2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for one the union's child types

Advantages / drawbacks of convention 1
--------------------------------------

+ consistency: `MakeNullScalar(union type)` always returns a scalar with `is_valid` set to false

- `union_array.GetScalar(i)` can return a null scalar even when `union_array.IsNull(i)` would return false

Advantages / drawbacks of convention 2
--------------------------------------

+ more congruent with the union datatype definition

+ `union_array.GetScalar(i)` always returns a valid scalar, just like `union_array.IsNull(i)` always returns false

- `MakeNullScalar(union type)` returns a valid scalar (or should it return an error??)

- breaks compatibility, since the current behaviour follows convention 1 (but UnionScalar is probably not widely used...)


This came in the context of https://github.com/apache/arrow/pull/10817, but is unrelated to the changes in this PR.

Thoughts? Opinions?

Regards

Antoine.

Reply via email to