The Union.typeIds property is confusing and its utility is unclear. I'd
like to remove it (or at least document it better). Unless anyone knows a
real advantage for keeping it I plan to assemble a PR to drop it from the
format and the C++ implementation.

ARROW-257 ( resolved by pull request
https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
typeIds property (in the C++ implementation, this is
UnionType::type_codes). Prior to that pull request each element (int8) in
the type_ids (second) buffer of a union array was the index of a child
array. Thus a type_ids buffer beginning with 5 indicated that the union
array began with a value from child_data[5]. After that change to interpret
a type_id of 5 one must look through the typeIds property and the index at
which a 5 is found is the index of the corresponding child array.

The change was made to allow unused child arrays to be dropped; for example
if a union type were predefined with 64 members then an array of nearly
identical type containing only int32 and utf8 values would only be required
to have two child arrays. Note: the union types are not exactly identical
even though they contain identical members; their typeIds properties will
differ.

However unused child arrays can be replaced by null arrays (which are
almost equally lightweight as they require no heap allocation). I'm also
unaware of a use case for predefined type_ids; if they are application
specific then I think it's out of scope for arrow to maintain a child_index
<-> type_id mapping. It seems that the optimization has questionable merit
and does not warrant the added complexity.

Reply via email to