Jim Pivarski created ARROW-9556: ----------------------------------- Summary: Segfaults in UnionArray with null values Key: ARROW-9556 URL: https://issues.apache.org/jira/browse/ARROW-9556 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 1.0.0 Environment: Conda, but pyarrow was installed using pip (in the conda environment) Reporter: Jim Pivarski
Extracting null values from a UnionArray containing nulls and constructing a UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the following run correctly without segfaults in the older version. Here's a UnionArray that works (because there are no nulls): {code:java} # GOOD a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]), pyarrow.array([True, True, False, True, False]), ], ) a.to_pylist(){code} Here's one the fails when you try a.to_pylist() or even just a[2], because one of the children has a null at 2: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), pyarrow.array([True, True, False, True, False]), ], ) a.to_pylist() # also just a[2] causes a segfault{code} Here's another that fails because both children have nulls; the segfault occurs at both positions with nulls: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_sparse( pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), [ pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), pyarrow.array([True, None, False, True, False]), ], ) a.to_pylist() # also a[1] and a[2] cause segfaults{code} Here's one that succeeds, but it's dense, rather than sparse: {code:java} # GOOD a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])], ) a.to_pylist(){code} Here's a dense that fails because one child has a null: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])], ) a.to_pylist() # also just a[3] causes a segfault{code} Here's a dense that fails in two positions because both children have a null: {code:java} # SEGFAULT a = pyarrow.UnionArray.from_dense( pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])], ) a.to_pylist() # also a[3] and a[5] cause segfaults{code} In all of the above, we created the UnionArray using its from_dense method. We could instead create it with pyarrow.Array.from_buffers. If created with content0 and content1 that have no nulls, it's fine, but if created with nulls in the content, it segfaults as soon as you view the null value. {code:java} # GOOD content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) # SEGFAULT content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "sparse", [0, 1], ) a = pyarrow.Array.from_buffers( types, 5, [ None, pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), ], children=[content0, content1], ) a.to_pylist() # also just a[3] causes a segfault{code} Similarly for a dense union. {code:java} # GOOD content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) content1 = pyarrow.array([True, True, False]) # SEGFAULT content0 = pyarrow.array([0.0, 1.1, None, 3.3]) content1 = pyarrow.array([True, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "dense", [0, 1], ) a = pyarrow.Array.from_buffers( types, 7, [ None, pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)), pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)), ], children=[content0, content1], ) a.to_pylist() # also just a[3] causes a segfault{code} The next segfaults are different: instead of putting the null values in the content, we put the null value in the UnionArray itself. This time, it segfaults when it is being created. It also prints some output (all of the above were silent segfaults). {code:java} # SEGFAULT (even to create) content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) content1 = pyarrow.array([True, True, False, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "sparse", [0, 1], ) a = pyarrow.Array.from_buffers( types, 5, [ pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011) pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), # exepct null here -----^ # None <--- placeholder required in pyarrow 0.17.0, not 1.0.0 ], children=[content0, content1], ) # /arrow/cpp/src/arrow/array/array_nested.cc:617: Check failed: (data_->buffers[0]) == (nullptr) # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7feea9937938] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7feea993814d] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArray7SetDataESt10shared_ptrINS_9ArrayDataEE+0x144)[0x7feea9a869a4] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArrayC1ESt10shared_ptrINS_9ArrayDataEE+0x5a)[0x7feea9a86a2a] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0x9fc)[0x7feea9a5145c] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7feea9a2698f] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7feeaa998853] # python(+0x13af9e)[0x56146ee77f9e] # python(_PyObject_MakeTpCall+0x3bf)[0x56146ee6d30f] # python(_PyEval_EvalFrameDefault+0x5452)[0x56146ef20602] # python(_PyEval_EvalCodeWithName+0x260)[0x56146ef06190] # python(PyEval_EvalCode+0x23)[0x56146ef07a03] # python(+0x23e2f2)[0x56146ef7b2f2] # python(+0x251082)[0x56146ef8e082] # python(+0x1063b9)[0x56146ee433b9] # python(PyRun_InteractiveLoopFlags+0xea)[0x56146ee43559] # python(+0x1065f3)[0x56146ee435f3] # python(+0x106817)[0x56146ee43817] # python(Py_BytesMain+0x39)[0x56146ef91a19] # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7feeac198b97] # python(+0x1f8807)[0x56146ef35807] # Aborted (core dumped) {code} And similarly for dense. {code:java} # SEGFAULT (even to create) content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) content1 = pyarrow.array([True, True, False]) types = pyarrow.union( [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], "dense", [0, 1], ) a = pyarrow.Array.from_buffers( types, 7, [ pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011) pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)), pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)), # exepct null here -----^ ], children=[content0, content1], ) # /arrow/cpp/src/arrow/array/array_nested.cc:627: Check failed: (data_->buffers[0]) == (nullptr) # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7f2fb6ad7938] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f2fb6ad814d] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x174)[0x7f2fb6c274a4] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x44)[0x7f2fb6c27524] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0xb14)[0x7f2fb6bf1574] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7f2fb6bc698f] # /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7f2fb7b38853] # python(+0x13af9e)[0x558cf09edf9e] # python(_PyObject_MakeTpCall+0x3bf)[0x558cf09e330f] # python(_PyEval_EvalFrameDefault+0x5452)[0x558cf0a96602] # python(_PyEval_EvalCodeWithName+0x260)[0x558cf0a7c190] # python(PyEval_EvalCode+0x23)[0x558cf0a7da03] # python(+0x23e2f2)[0x558cf0af12f2] # python(+0x251082)[0x558cf0b04082] # python(+0x1063b9)[0x558cf09b93b9] # python(PyRun_InteractiveLoopFlags+0xea)[0x558cf09b9559] # python(+0x1065f3)[0x558cf09b95f3] # python(+0x106817)[0x558cf09b9817] # python(Py_BytesMain+0x39)[0x558cf0b07a19] # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f2fb9338b97] # python(+0x1f8807)[0x558cf0aab807] # Aborted (core dumped){code} It might be two distinct bugs, but they're both related to UnionArrays and nulls, and they're both newer than 0.17.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)