Hello,
The Scalar base class has a `bool is_valid` member that is used to
represent null scalars for all types (including the null type).
A UnionScalar, since it inherits from Scalar, has the following de facto
structure:
{
// whether the scalar is null
bool is_valid;
// the underlying union value (only set if `is_valid` is true)
std::shared_ptr<Scalar> value;
}
However, union arrays don't have a top-level validity bitmap. A null in
an union array is always encoded as a "valid" top-level entry with a
null entry in the chosen child array.
Therefore, there are two possible conventions for representing null
union scalars:
1) set `scalar.is_valid` to false and `scalar.value` to nullptr
2) set `scalar.is_valid` to true and `scalar.value` to a null scalar for
one the union's child types
Advantages / drawbacks of convention 1
--------------------------------------
+ consistency: `MakeNullScalar(union type)` always returns a scalar with
`is_valid` set to false
- `union_array.GetScalar(i)` can return a null scalar even when
`union_array.IsNull(i)` would return false
Advantages / drawbacks of convention 2
--------------------------------------
+ more congruent with the union datatype definition
+ `union_array.GetScalar(i)` always returns a valid scalar, just like
`union_array.IsNull(i)` always returns false
- `MakeNullScalar(union type)` returns a valid scalar (or should it
return an error??)
- breaks compatibility, since the current behaviour follows convention 1
(but UnionScalar is probably not widely used...)
This came in the context of https://github.com/apache/arrow/pull/10817,
but is unrelated to the changes in this PR.
Thoughts? Opinions?
Regards
Antoine.