In this scenario option A (include child arrays for each child type, even
if that type is not observed) seems like the clearly correct choice to me.
It yiedls a more intuitive layout for the union array and incurs no runtime
overhead (since the absent children are empty/null arrays).

> why not allow them to be flexible in this regard?

I would say that if code doesn't add anything except cognitive overhead
then it's worthwhile to remove it.

On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <[email protected]> wrote:

> hi Ben,
>
> Some applications use static type ids for various data types. Let's
> consider one possibility:
>
> BOOLEAN: 0
> INT32: 1
> DOUBLE: 2
> STRING (UTF8): 3
>
> If you were parsing JSON and constructing unions while parsing, you
> might encounter some types, but not all. So if we _don't_ have the
> option of having type ids in the metadata then we are left with some
> unsatisfactory options:
>
> A: Include all types in the resulting union, even if they are unobserved,
> or
> B: Assign type id dynamically to types when they are observed
>
> Option B: is potentially bad because it does not parallelize across
> threads or nodes.
>
> So I do think the feature is useful. It does make the implementations
> of unions more complex, though, so it does not come without cost. But
> unions are already the most complex tool we have in our nested data
> toolbox, so why not allow them to be flexible in this regard?
>
> In any case I'm -0 on making changes, but would be interested in
> feedback of others if there is strong sentiment about deprecating the
> feature.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <[email protected]>
> wrote:
> >
> > The Union.typeIds property is confusing and its utility is unclear. I'd
> > like to remove it (or at least document it better). Unless anyone knows a
> > real advantage for keeping it I plan to assemble a PR to drop it from the
> > format and the C++ implementation.
> >
> > ARROW-257 ( resolved by pull request
> > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> optional
> > typeIds property (in the C++ implementation, this is
> > UnionType::type_codes). Prior to that pull request each element (int8) in
> > the type_ids (second) buffer of a union array was the index of a child
> > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > array began with a value from child_data[5]. After that change to
> interpret
> > a type_id of 5 one must look through the typeIds property and the index
> at
> > which a 5 is found is the index of the corresponding child array.
> >
> > The change was made to allow unused child arrays to be dropped; for
> example
> > if a union type were predefined with 64 members then an array of nearly
> > identical type containing only int32 and utf8 values would only be
> required
> > to have two child arrays. Note: the union types are not exactly identical
> > even though they contain identical members; their typeIds properties will
> > differ.
> >
> > However unused child arrays can be replaced by null arrays (which are
> > almost equally lightweight as they require no heap allocation). I'm also
> > unaware of a use case for predefined type_ids; if they are application
> > specific then I think it's out of scope for arrow to maintain a
> child_index
> > <-> type_id mapping. It seems that the optimization has questionable
> merit
> > and does not warrant the added complexity.
>

Reply via email to