In this scenario option A (include child arrays for each child type, even if that type is not observed) seems like the clearly correct choice to me. It yiedls a more intuitive layout for the union array and incurs no runtime overhead (since the absent children are empty/null arrays).
> why not allow them to be flexible in this regard? I would say that if code doesn't add anything except cognitive overhead then it's worthwhile to remove it. On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <[email protected]> wrote: > hi Ben, > > Some applications use static type ids for various data types. Let's > consider one possibility: > > BOOLEAN: 0 > INT32: 1 > DOUBLE: 2 > STRING (UTF8): 3 > > If you were parsing JSON and constructing unions while parsing, you > might encounter some types, but not all. So if we _don't_ have the > option of having type ids in the metadata then we are left with some > unsatisfactory options: > > A: Include all types in the resulting union, even if they are unobserved, > or > B: Assign type id dynamically to types when they are observed > > Option B: is potentially bad because it does not parallelize across > threads or nodes. > > So I do think the feature is useful. It does make the implementations > of unions more complex, though, so it does not come without cost. But > unions are already the most complex tool we have in our nested data > toolbox, so why not allow them to be flexible in this regard? > > In any case I'm -0 on making changes, but would be interested in > feedback of others if there is strong sentiment about deprecating the > feature. > > - Wes > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <[email protected]> > wrote: > > > > The Union.typeIds property is confusing and its utility is unclear. I'd > > like to remove it (or at least document it better). Unless anyone knows a > > real advantage for keeping it I plan to assemble a PR to drop it from the > > format and the C++ implementation. > > > > ARROW-257 ( resolved by pull request > > https://github.com/apache/arrow/pull/143 ) extended Unions with an > optional > > typeIds property (in the C++ implementation, this is > > UnionType::type_codes). Prior to that pull request each element (int8) in > > the type_ids (second) buffer of a union array was the index of a child > > array. Thus a type_ids buffer beginning with 5 indicated that the union > > array began with a value from child_data[5]. After that change to > interpret > > a type_id of 5 one must look through the typeIds property and the index > at > > which a 5 is found is the index of the corresponding child array. > > > > The change was made to allow unused child arrays to be dropped; for > example > > if a union type were predefined with 64 members then an array of nearly > > identical type containing only int32 and utf8 values would only be > required > > to have two child arrays. Note: the union types are not exactly identical > > even though they contain identical members; their typeIds properties will > > differ. > > > > However unused child arrays can be replaced by null arrays (which are > > almost equally lightweight as they require no heap allocation). I'm also > > unaware of a use case for predefined type_ids; if they are application > > specific then I think it's out of scope for arrow to maintain a > child_index > > <-> type_id mapping. It seems that the optimization has questionable > merit > > and does not warrant the added complexity. >
