jorisvandenbossche commented on issue #14736:
URL: https://github.com/apache/arrow/issues/14736#issuecomment-1415879965

   To summarize this issue (see also the discussion on this closed 
PR:https://github.com/apache/arrow/pull/14838/files#r1039544789): when 
selecting a field out of a StructArray (`StructArray::GetFlattenedField` in 
C++), the top-level validity bitmap of the struct array needs to be combined 
with the validity bitmap of the child array. For most types this means 
combining the two bitmaps (`BitmapAnd`) and setting that on the resulting field 
array. However, in case the child field is a UnionArray, this is more 
complicated, because a union array itself doesn't have a validity bitmap, only 
each of its childs has one 
(https://arrow.apache.org/docs/dev/format/Columnar.html#union-layout). So to 
combine the parent bitmap with the union field, it has to be combined with the 
bitmap of each of the union's child arrays. 
   Currently, the code in `GetFlattenedField` just sets the and-ed bitmaps on 
the returned array:
   
   
https://github.com/apache/arrow/blob/54ff2d8777717ea5bb811f3653deeb12fc93452e/cpp/src/arrow/array/array_nested.cc#L661
   
   In case `flattened_data` is the data for a UnionArray, this violates the 
expectation that the first buffer (validity bitmap) is always null for unions, 
and this causes a crash.
   
   A minimal reproducer to get the crash:
   
   ```python
   binary = pa.array([b'a', b' ', b'b', b'c', b' ', b' ', b'd'], type='binary')
   int64 = pa.array([0, 1, 0, 0, 2, 3, 0], type='int64')
   types = pa.array([0, 1, 0, 0, 1, 1, 0], type='int8')
   union_array = pa.UnionArray.from_sparse(types, [binary, int64], ['bin', 
'int'])
   
   int_array = pa.array(range(7))
   # struct array with union array child and a validity bitmap that is present
   struct_array = pa.StructArray.from_arrays(
       [int_array, union_array], names=["int", "union"], 
mask=pa.array([False]*7)
   )
   struct_array.type
   # StructType(struct<int: int64, union: sparse_union<bin: binary=0, int: 
int64=1>>)
   
   import pyarrow.compute as pc
   # using struct_field() kernel to select a field -> works for int field
   pc.struct_field(struct_array, ["int"])
   # crashes for union field
   pc.struct_field(struct_array, ["union"])
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to