[ https://issues.apache.org/jira/browse/ARROW-9556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164722#comment-17164722 ]
Jim Pivarski commented on ARROW-9556: ------------------------------------- The second case (segfault on construction) is due to top-level nulls, but in the first case (segfault on get-item), the nulls are on the leaf nodes. I'll take a look at the revised format specification, but top-level nulls have only been removed from unions, right? (Top-level vs not top-level isn't distinguishable on unions, but it would be visible on records or lists.) > Segfaults in UnionArray with null values > ---------------------------------------- > > Key: ARROW-9556 > URL: https://issues.apache.org/jira/browse/ARROW-9556 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 1.0.0 > Environment: Conda, but pyarrow was installed using pip (in the conda > environment) > Reporter: Jim Pivarski > Priority: Major > > Extracting null values from a UnionArray containing nulls and constructing a > UnionArray with a bitmask in pyarrow.Array.from_buffers causes segfaults in > pyarrow 1.0.0. I have an environment with pyarrow 0.17.0 and all of the > following run correctly without segfaults in the older version. > Here's a UnionArray that works (because there are no nulls): > > {code:java} > # GOOD > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]), > pyarrow.array([True, True, False, True, False]), > ], > ) > a.to_pylist(){code} > > Here's one the fails when you try a.to_pylist() or even just a[2], because > one of the children has a null at 2: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), > pyarrow.array([True, True, False, True, False]), > ], > ) > a.to_pylist() # also just a[2] causes a segfault{code} > > Here's another that fails because both children have nulls; the segfault > occurs at both positions with nulls: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_sparse( > pyarrow.array([0, 1, 0, 0, 1], type=pyarrow.int8()), > [ > pyarrow.array([0.0, 1.1, None, 3.3, 4.4]), > pyarrow.array([True, None, False, True, False]), > ], > ) > a.to_pylist() # also a[1] and a[2] cause segfaults{code} > > Here's one that succeeds, but it's dense, rather than sparse: > > {code:java} > # GOOD > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, 2.2, 3.3]), pyarrow.array([True, True, False])], > ) > a.to_pylist(){code} > > Here's a dense that fails because one child has a null: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, True, False])], > ) > a.to_pylist() # also just a[3] causes a segfault{code} > > Here's a dense that fails in two positions because both children have a null: > > {code:java} > # SEGFAULT > a = pyarrow.UnionArray.from_dense( > pyarrow.array([0, 1, 0, 0, 0, 1, 1], type=pyarrow.int8()), > pyarrow.array([0, 0, 1, 2, 3, 1, 2], type=pyarrow.int32()), > [pyarrow.array([0.0, 1.1, None, 3.3]), pyarrow.array([True, None, False])], > ) > a.to_pylist() # also a[3] and a[5] cause segfaults{code} > > In all of the above, we created the UnionArray using its from_dense method. > We could instead create it with pyarrow.Array.from_buffers. If created with > content0 and content1 that have no nulls, it's fine, but if created with > nulls in the content, it segfaults as soon as you view the null value. > > {code:java} > # GOOD > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) > content1 = pyarrow.array([True, True, False, True, False]) > # SEGFAULT > content0 = pyarrow.array([0.0, 1.1, 2.2, None, 4.4]) > content1 = pyarrow.array([True, True, False, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "sparse", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, > 5, > [ > None, > pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), > ], > children=[content0, content1], > ) > a.to_pylist() # also just a[3] causes a segfault{code} > > Similarly for a dense union. > > {code:java} > # GOOD > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) > content1 = pyarrow.array([True, True, False]) > # SEGFAULT > content0 = pyarrow.array([0.0, 1.1, None, 3.3]) > content1 = pyarrow.array([True, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "dense", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, > 7, > [ > None, > pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)), > pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)), > ], > children=[content0, content1], > ) > a.to_pylist() # also just a[3] causes a segfault{code} > > The next segfaults are different: instead of putting the null values in the > content, we put the null value in the UnionArray itself. This time, it > segfaults when it is being created. It also prints some output (all of the > above were silent segfaults). > > {code:java} > # SEGFAULT (even to create) > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3, 4.4]) > content1 = pyarrow.array([True, True, False, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "sparse", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, > 5, > [ > pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011) > pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 1], numpy.int8)), > # exepct null here -----^ > # None <--- placeholder required in pyarrow 0.17.0, not 1.0.0 > ], > children=[content0, content1], > ) > # /arrow/cpp/src/arrow/array/array_nested.cc:617: Check failed: > (data_->buffers[0]) == (nullptr) > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7feea9937938] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7feea993814d] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArray7SetDataESt10shared_ptrINS_9ArrayDataEE+0x144)[0x7feea9a869a4] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow16SparseUnionArrayC1ESt10shared_ptrINS_9ArrayDataEE+0x5a)[0x7feea9a86a2a] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0x9fc)[0x7feea9a5145c] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7feea9a2698f] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7feeaa998853] > # python(+0x13af9e)[0x56146ee77f9e] > # python(_PyObject_MakeTpCall+0x3bf)[0x56146ee6d30f] > # python(_PyEval_EvalFrameDefault+0x5452)[0x56146ef20602] > # python(_PyEval_EvalCodeWithName+0x260)[0x56146ef06190] > # python(PyEval_EvalCode+0x23)[0x56146ef07a03] > # python(+0x23e2f2)[0x56146ef7b2f2] > # python(+0x251082)[0x56146ef8e082] > # python(+0x1063b9)[0x56146ee433b9] > # python(PyRun_InteractiveLoopFlags+0xea)[0x56146ee43559] > # python(+0x1065f3)[0x56146ee435f3] > # python(+0x106817)[0x56146ee43817] > # python(Py_BytesMain+0x39)[0x56146ef91a19] > # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7feeac198b97] > # python(+0x1f8807)[0x56146ef35807] > # Aborted (core dumped) > {code} > > And similarly for dense. > > {code:java} > # SEGFAULT (even to create) > content0 = pyarrow.array([0.0, 1.1, 2.2, 3.3]) > content1 = pyarrow.array([True, True, False]) > types = pyarrow.union( > [pyarrow.field("0", content0.type), pyarrow.field("1", content1.type)], > "dense", > [0, 1], > ) > a = pyarrow.Array.from_buffers( > types, > 7, > [ > pyarrow.py_buffer(numpy.array([251], numpy.uint8)), # (11111011) > pyarrow.py_buffer(numpy.array([0, 1, 0, 0, 0, 1, 1], numpy.int8)), > pyarrow.py_buffer(numpy.array([0, 0, 1, 2, 3, 1, 2], numpy.int32)), > # exepct null here -----^ > ], > children=[content0, content1], > ) > # /arrow/cpp/src/arrow/array/array_nested.cc:627: Check failed: > (data_->buffers[0]) == (nullptr) > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(+0x4e9938)[0x7f2fb6ad7938] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow4util8ArrowLogD1Ev+0xdd)[0x7f2fb6ad814d] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArray7SetDataERKSt10shared_ptrINS_9ArrayDataEE+0x174)[0x7f2fb6c274a4] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15DenseUnionArrayC2ERKSt10shared_ptrINS_9ArrayDataEE+0x44)[0x7f2fb6c27524] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow15VisitTypeInlineINS_8internal16ArrayDataWrapperEEENS_6StatusERKNS_8DataTypeEPT_+0xb14)[0x7f2fb6bf1574] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/libarrow.so.100(_ZN5arrow9MakeArrayERKSt10shared_ptrINS_9ArrayDataEE+0x3f)[0x7f2fb6bc698f] > # > /home/pivarski/miniconda3/envs/test-arrow/lib/python3.8/site-packages/pyarrow/lib.cpython-38-x86_64-linux-gnu.so(+0x1c7853)[0x7f2fb7b38853] > # python(+0x13af9e)[0x558cf09edf9e] > # python(_PyObject_MakeTpCall+0x3bf)[0x558cf09e330f] > # python(_PyEval_EvalFrameDefault+0x5452)[0x558cf0a96602] > # python(_PyEval_EvalCodeWithName+0x260)[0x558cf0a7c190] > # python(PyEval_EvalCode+0x23)[0x558cf0a7da03] > # python(+0x23e2f2)[0x558cf0af12f2] > # python(+0x251082)[0x558cf0b04082] > # python(+0x1063b9)[0x558cf09b93b9] > # python(PyRun_InteractiveLoopFlags+0xea)[0x558cf09b9559] > # python(+0x1065f3)[0x558cf09b95f3] > # python(+0x106817)[0x558cf09b9817] > # python(Py_BytesMain+0x39)[0x558cf0b07a19] > # /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f2fb9338b97] > # python(+0x1f8807)[0x558cf0aab807] > # Aborted (core dumped){code} > > It might be two distinct bugs, but they're both related to UnionArrays and > nulls, and they're both newer than 0.17.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)