Jim Pivarski created ARROW-9556:
-----------------------------------

             Summary: Segfaults in UnionArray with null values
                 Key: ARROW-9556
                 URL: https://issues.apache.org/jira/browse/ARROW-9556
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.0
         Environment: Conda, but pyarrow was installed using pip (in the conda 
environment)
            Reporter: Jim Pivarski


Extracting null values from a UnionArray containing nulls and constructing a 
UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in 
pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the 
following run correctly without segfaults in the older version.

Here's a UnionArray that works (because there are no nulls):

 
{code:java}
# GOOD
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]),
 pyarrow.array([True, True, False, True, False]),
 ],
)
a.to_pylist(){code}
 

Here's one the fails when you try a.to_pylist() or even just a[2], because one 
of the children has a null at 2:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
 pyarrow.array([True, True, False, True, False]),
 ],
)
a.to_pylist() # also just a[2] causes a segfault{code}
 

Here's another that fails because both children have nulls; the segfault occurs 
at both positions with nulls:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_sparse(
 pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()),
 [
 pyarrow.array([0.0, 1.1, None, 3.3, 4.4]),
 pyarrow.array([True, None, False, True, False]),
 ],
)
a.to_pylist() # also a[1] and a[2] cause segfaults{code}
 

Here's one that succeeds, but it's dense, rather than sparse:

 
{code:java}
# GOOD
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])],
)
a.to_pylist(){code}
 

Here's a dense that fails because one child has a null:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

Here's a dense that fails in two positions because both children have a null:

 
{code:java}
# SEGFAULT
a = pyarrow.UnionArray.from_dense(
 pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()),
 pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()),
 [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])],
)
a.to_pylist() # also a[3] and a[5] cause segfaults{code}
 

In all of the above, we created the UnionArray using its from_dense method. We 
could instead create it with pyarrow.Array.from_buffers. If created with 
content0 and content1 that have no nulls, it's fine, but if created with nulls 
in the content, it segfaults as soon as you view the null value.

 
{code:java}
# GOOD
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
# SEGFAULT
content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "sparse",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 5,
 [
 None,
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)),
 ],
 children=[content0, content1],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

Similarly for a dense union.

 
{code:java}
# GOOD
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3])
content1 = pyarrow.array([True, True, False])
# SEGFAULT
content0 = pyarrow.array([0.0, 1.1, None, 3.3])
content1 = pyarrow.array([True, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "dense",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 7,
 [
 None,
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)),
 pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)),
 ],
 children=[content0, content1],
)
a.to_pylist() # also just a[3] causes a segfault{code}
 

The next segfaults are different: instead of putting the null values in the 
content, we put the null value in the UnionArray itself. This time, it 
segfaults when it is being created. It also prints some output (all of the 
above were silent segfaults).

 
{code:java}
# SEGFAULT (even to create)
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4])
content1 = pyarrow.array([True, True, False, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "sparse",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 5,
 [
 pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011)
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)),
 # exepct null here -----^
# None <--- placeholder required in pyarrow 0.17.0, not 1.0.0
 ],
 children=[content0, content1],
)
# /arrow/cpp/src/arrow/array/array_nested.cc:617: Check failed: 
(data_->buffers[0]) == (nullptr) 
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7feea9937938]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7feea993814d]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArray7SetDataESt10shared_ptrINS_9ArrayDataEE+0x144)[0x7feea9a869a4]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArrayC1ESt10shared_ptrINS_9ArrayDataEE+0x5a)[0x7feea9a86a2a]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0x9fc)[0x7feea9a5145c]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7feea9a2698f]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7feeaa998853]
# python(+0x13af9e)[0x56146ee77f9e]
# python(_PyObject_MakeTpCall+0x3bf)[0x56146ee6d30f]
# python(_PyEval_EvalFrameDefault+0x5452)[0x56146ef20602]
# python(_PyEval_EvalCodeWithName+0x260)[0x56146ef06190]
# python(PyEval_EvalCode+0x23)[0x56146ef07a03]
# python(+0x23e2f2)[0x56146ef7b2f2]
# python(+0x251082)[0x56146ef8e082]
# python(+0x1063b9)[0x56146ee433b9]
# python(PyRun_InteractiveLoopFlags+0xea)[0x56146ee43559]
# python(+0x1065f3)[0x56146ee435f3]
# python(+0x106817)[0x56146ee43817]
# python(Py_BytesMain+0x39)[0x56146ef91a19]
# /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7feeac198b97]
# python(+0x1f8807)[0x56146ef35807]
# Aborted (core dumped)
{code}
 

And similarly for dense.

 
{code:java}
# SEGFAULT (even to create)
content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3])
content1 = pyarrow.array([True, True, False])
types = pyarrow.union(
 [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)],
 "dense",
 [0, 1],
)
a = pyarrow.Array.from_buffers(
 types,
 7,
 [
 pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011)
 pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)),
 pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)),
 # exepct null here -----^
 ],
 children=[content0, content1],
)
# /arrow/cpp/src/arrow/array/array_nested.cc:627: Check failed: 
(data_->buffers[0]) == (nullptr) 
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7f2fb6ad7938]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f2fb6ad814d]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x174)[0x7f2fb6c274a4]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x44)[0x7f2fb6c27524]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0xb14)[0x7f2fb6bf1574]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7f2fb6bc698f]
# 
/home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7f2fb7b38853]
# python(+0x13af9e)[0x558cf09edf9e]
# python(_PyObject_MakeTpCall+0x3bf)[0x558cf09e330f]
# python(_PyEval_EvalFrameDefault+0x5452)[0x558cf0a96602]
# python(_PyEval_EvalCodeWithName+0x260)[0x558cf0a7c190]
# python(PyEval_EvalCode+0x23)[0x558cf0a7da03]
# python(+0x23e2f2)[0x558cf0af12f2]
# python(+0x251082)[0x558cf0b04082]
# python(+0x1063b9)[0x558cf09b93b9]
# python(PyRun_InteractiveLoopFlags+0xea)[0x558cf09b9559]
# python(+0x1065f3)[0x558cf09b95f3]
# python(+0x106817)[0x558cf09b9817]
# python(Py_BytesMain+0x39)[0x558cf0b07a19]
# /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f2fb9338b97]
# python(+0x1f8807)[0x558cf0aab807]
# Aborted (core dumped){code}
 

It might be two distinct bugs, but they're both related to UnionArrays and 
nulls, and they're both newer than 0.17.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to