I was also supportive of this pattern. We definitely have used it before to
optimize in certain cases.

On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <[email protected]> wrote:

> On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <[email protected]>
> wrote:
> >
> > In this scenario option A (include child arrays for each child type, even
> > if that type is not observed) seems like the clearly correct choice to
> me.
> > It yiedls a more intuitive layout for the union array and incurs no
> runtime
> > overhead (since the absent children are empty/null arrays).
>
> I am not sure this is right. The child arrays still occupy memory in
> the Sparse Union case (where all child arrays have the same length).
> In order to satisfy the requirements of the IPC protocol, the child
> arrays need to be of the same type as the types in the union. In the
> Dense Union case, the not-present children will have length 0.
>
> >
> > > why not allow them to be flexible in this regard?
> >
> > I would say that if code doesn't add anything except cognitive overhead
> > then it's worthwhile to remove it.
>
> The cognitive overhead comes for the Arrow library implementer --
> users of the libraries aren't required to deal with this detail
> necessarily. The type ids are optional, after all. Even if it is
> removed, you still have ids, so whether it's
>
> type 0, id=0
> type 1, id=1
> type 2, id=2
>
> or
>
> type 0, id=3
> type 1, id=7
> type 2, id=10
>
> the difference is in the second case, you have to look up the code
> corresponding to each type rather than assuming that the type's
> position and its code are the same.
>
> In processing, branching should occur at the Type level, so a function
> to process a child looks like
>
> ProcessChild(child, child_id, ...)
>
> In either case you have to match a child with its id that appears in the
> data.
>
> Anyway, since Julien and I are responsible for introducing this
> concept in the early stages of the project I'm interested to hear more
> from others. Note that this doesn't serve to resolve the
> Union-of-Nested-Types problem that has prevented the development of
> integration tests between Java and C++.
>
> >
> > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <[email protected]>
> wrote:
> >
> > > hi Ben,
> > >
> > > Some applications use static type ids for various data types. Let's
> > > consider one possibility:
> > >
> > > BOOLEAN: 0
> > > INT32: 1
> > > DOUBLE: 2
> > > STRING (UTF8): 3
> > >
> > > If you were parsing JSON and constructing unions while parsing, you
> > > might encounter some types, but not all. So if we _don't_ have the
> > > option of having type ids in the metadata then we are left with some
> > > unsatisfactory options:
> > >
> > > A: Include all types in the resulting union, even if they are
> unobserved,
> > > or
> > > B: Assign type id dynamically to types when they are observed
> > >
> > > Option B: is potentially bad because it does not parallelize across
> > > threads or nodes.
> > >
> > > So I do think the feature is useful. It does make the implementations
> > > of unions more complex, though, so it does not come without cost. But
> > > unions are already the most complex tool we have in our nested data
> > > toolbox, so why not allow them to be flexible in this regard?
> > >
> > > In any case I'm -0 on making changes, but would be interested in
> > > feedback of others if there is strong sentiment about deprecating the
> > > feature.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <[email protected]
> >
> > > wrote:
> > > >
> > > > The Union.typeIds property is confusing and its utility is unclear.
> I'd
> > > > like to remove it (or at least document it better). Unless anyone
> knows a
> > > > real advantage for keeping it I plan to assemble a PR to drop it
> from the
> > > > format and the C++ implementation.
> > > >
> > > > ARROW-257 ( resolved by pull request
> > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > optional
> > > > typeIds property (in the C++ implementation, this is
> > > > UnionType::type_codes). Prior to that pull request each element
> (int8) in
> > > > the type_ids (second) buffer of a union array was the index of a
> child
> > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> union
> > > > array began with a value from child_data[5]. After that change to
> > > interpret
> > > > a type_id of 5 one must look through the typeIds property and the
> index
> > > at
> > > > which a 5 is found is the index of the corresponding child array.
> > > >
> > > > The change was made to allow unused child arrays to be dropped; for
> > > example
> > > > if a union type were predefined with 64 members then an array of
> nearly
> > > > identical type containing only int32 and utf8 values would only be
> > > required
> > > > to have two child arrays. Note: the union types are not exactly
> identical
> > > > even though they contain identical members; their typeIds properties
> will
> > > > differ.
> > > >
> > > > However unused child arrays can be replaced by null arrays (which are
> > > > almost equally lightweight as they require no heap allocation). I'm
> also
> > > > unaware of a use case for predefined type_ids; if they are
> application
> > > > specific then I think it's out of scope for arrow to maintain a
> > > child_index
> > > > <-> type_id mapping. It seems that the optimization has questionable
> > > merit
> > > > and does not warrant the added complexity.
> > >
>

Reply via email to