The Union.typeIds property is confusing and its utility is unclear. I'd like to remove it (or at least document it better). Unless anyone knows a real advantage for keeping it I plan to assemble a PR to drop it from the format and the C++ implementation.
ARROW-257 ( resolved by pull request https://github.com/apache/arrow/pull/143 ) extended Unions with an optional typeIds property (in the C++ implementation, this is UnionType::type_codes). Prior to that pull request each element (int8) in the type_ids (second) buffer of a union array was the index of a child array. Thus a type_ids buffer beginning with 5 indicated that the union array began with a value from child_data[5]. After that change to interpret a type_id of 5 one must look through the typeIds property and the index at which a 5 is found is the index of the corresponding child array. The change was made to allow unused child arrays to be dropped; for example if a union type were predefined with 64 members then an array of nearly identical type containing only int32 and utf8 values would only be required to have two child arrays. Note: the union types are not exactly identical even though they contain identical members; their typeIds properties will differ. However unused child arrays can be replaced by null arrays (which are almost equally lightweight as they require no heap allocation). I'm also unaware of a use case for predefined type_ids; if they are application specific then I think it's out of scope for arrow to maintain a child_index <-> type_id mapping. It seems that the optimization has questionable merit and does not warrant the added complexity.