I was also supportive of this pattern. We definitely have used it before to optimize in certain cases.
On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <[email protected]> wrote: > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <[email protected]> > wrote: > > > > In this scenario option A (include child arrays for each child type, even > > if that type is not observed) seems like the clearly correct choice to > me. > > It yiedls a more intuitive layout for the union array and incurs no > runtime > > overhead (since the absent children are empty/null arrays). > > I am not sure this is right. The child arrays still occupy memory in > the Sparse Union case (where all child arrays have the same length). > In order to satisfy the requirements of the IPC protocol, the child > arrays need to be of the same type as the types in the union. In the > Dense Union case, the not-present children will have length 0. > > > > > > why not allow them to be flexible in this regard? > > > > I would say that if code doesn't add anything except cognitive overhead > > then it's worthwhile to remove it. > > The cognitive overhead comes for the Arrow library implementer -- > users of the libraries aren't required to deal with this detail > necessarily. The type ids are optional, after all. Even if it is > removed, you still have ids, so whether it's > > type 0, id=0 > type 1, id=1 > type 2, id=2 > > or > > type 0, id=3 > type 1, id=7 > type 2, id=10 > > the difference is in the second case, you have to look up the code > corresponding to each type rather than assuming that the type's > position and its code are the same. > > In processing, branching should occur at the Type level, so a function > to process a child looks like > > ProcessChild(child, child_id, ...) > > In either case you have to match a child with its id that appears in the > data. > > Anyway, since Julien and I are responsible for introducing this > concept in the early stages of the project I'm interested to hear more > from others. Note that this doesn't serve to resolve the > Union-of-Nested-Types problem that has prevented the development of > integration tests between Java and C++. > > > > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <[email protected]> > wrote: > > > > > hi Ben, > > > > > > Some applications use static type ids for various data types. Let's > > > consider one possibility: > > > > > > BOOLEAN: 0 > > > INT32: 1 > > > DOUBLE: 2 > > > STRING (UTF8): 3 > > > > > > If you were parsing JSON and constructing unions while parsing, you > > > might encounter some types, but not all. So if we _don't_ have the > > > option of having type ids in the metadata then we are left with some > > > unsatisfactory options: > > > > > > A: Include all types in the resulting union, even if they are > unobserved, > > > or > > > B: Assign type id dynamically to types when they are observed > > > > > > Option B: is potentially bad because it does not parallelize across > > > threads or nodes. > > > > > > So I do think the feature is useful. It does make the implementations > > > of unions more complex, though, so it does not come without cost. But > > > unions are already the most complex tool we have in our nested data > > > toolbox, so why not allow them to be flexible in this regard? > > > > > > In any case I'm -0 on making changes, but would be interested in > > > feedback of others if there is strong sentiment about deprecating the > > > feature. > > > > > > - Wes > > > > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <[email protected] > > > > > wrote: > > > > > > > > The Union.typeIds property is confusing and its utility is unclear. > I'd > > > > like to remove it (or at least document it better). Unless anyone > knows a > > > > real advantage for keeping it I plan to assemble a PR to drop it > from the > > > > format and the C++ implementation. > > > > > > > > ARROW-257 ( resolved by pull request > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an > > > optional > > > > typeIds property (in the C++ implementation, this is > > > > UnionType::type_codes). Prior to that pull request each element > (int8) in > > > > the type_ids (second) buffer of a union array was the index of a > child > > > > array. Thus a type_ids buffer beginning with 5 indicated that the > union > > > > array began with a value from child_data[5]. After that change to > > > interpret > > > > a type_id of 5 one must look through the typeIds property and the > index > > > at > > > > which a 5 is found is the index of the corresponding child array. > > > > > > > > The change was made to allow unused child arrays to be dropped; for > > > example > > > > if a union type were predefined with 64 members then an array of > nearly > > > > identical type containing only int32 and utf8 values would only be > > > required > > > > to have two child arrays. Note: the union types are not exactly > identical > > > > even though they contain identical members; their typeIds properties > will > > > > differ. > > > > > > > > However unused child arrays can be replaced by null arrays (which are > > > > almost equally lightweight as they require no heap allocation). I'm > also > > > > unaware of a use case for predefined type_ids; if they are > application > > > > specific then I think it's out of scope for arrow to maintain a > > > child_index > > > > <-> type_id mapping. It seems that the optimization has questionable > > > merit > > > > and does not warrant the added complexity. > > > >
