[ https://issues.apache.org/jira/browse/ARROW-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siyuan Zhuang reassigned ARROW-3765: ------------------------------------ Assignee: Siyuan Zhuang > [Gandiva] Segfault when the validity bitmap has not been allocated > ------------------------------------------------------------------ > > Key: ARROW-3765 > URL: https://issues.apache.org/jira/browse/ARROW-3765 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Gandiva > Reporter: Siyuan Zhuang > Assignee: Siyuan Zhuang > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This is because the `validity buffer` could be `None`: > {code} > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [None, <pyarrow.lib.Buffer object at 0x110c1a228>] > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [<pyarrow.lib.Buffer object at 0x11a2b3030>, <pyarrow.lib.Buffer object at > 0x11a2b3228>]{code} > But Gandiva has not implemented it yet, thus accessing a nullptr: > {code} > void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const > arrow::ArrayData& array_data, EvalBatch* eval_batch) { > int buffer_idx = 0; > // TODO: > // - validity is optional > uint8_t* validity_buf = > const_cast<uint8_t*>(array_data.buffers[buffer_idx]->data()); > eval_batch->SetBuffer(desc.validity_idx(), validity_buf); > ++buffer_idx; > {code} > > Reproduce code: > {code:java} > frame_data = np.random.randint(0, 100, size=(2**22, 10)) > table = pa.Table.from_pandas(df) > filt = ... # Create any gandiva filter > r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # > segfault{code} > Backtrace: > {code:java} > * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x10) > * frame #0: 0x00000001060184fc > libarrow.12.dylib`arrow::Buffer::data(this=0x0000000000000000) const at > buffer.h:162 > frame #1: 0x0000000106fbed78 > libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x0000000100624dc8, > desc=0x000000010101e138, array_data=0x000000010061f8e8, > eval_batch=0x0000000100796848) at annotator.cc:65 > frame #2: 0x0000000106fbf4ed > libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x0000000100624dc8, > record_batch=0x00000001007a45b8, out_vector=size=1) at annotator.cc:94 > frame #3: 0x00000001071449b7 > libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x0000000100624da0, > record_batch=0x00000001007a45b8, output_vector=size=1) at > llvm_generator.cc:102 > frame #4: 0x0000000107059a4f > libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x000000010079c668, > batch=0x00000001007a45b8, > out_selection=std::__1::shared_ptr<gandiva::SelectionVector>::element_type @ > 0x00000001007a43e8 strong=2 weak=1) at filter.cc:106 > frame #5: 0x000000010948e002 > gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*, > _object*, _object*) + 1986 > frame #6: 0x0000000100140e8b Python`_PyCFunction_FastCallDict + 475 > frame #7: 0x00000001001d28ca Python`call_function + 602 > frame #8: 0x00000001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #9: 0x00000001001d3cf9 Python`fast_function + 569 > frame #10: 0x00000001001d2899 Python`call_function + 553 > frame #11: 0x00000001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #12: 0x00000001001d34c6 Python`_PyEval_EvalCodeWithName + 2902 > frame #13: 0x00000001001c96e0 Python`PyEval_EvalCode + 48 > frame #14: 0x00000001002029ae Python`PyRun_FileExFlags + 174 > frame #15: 0x0000000100201f75 Python`PyRun_SimpleFileExFlags + 277 > frame #16: 0x000000010021ef46 Python`Py_Main + 3558 > frame #17: 0x0000000100000e08 Python`___lldb_unnamed_symbol1$$Python + 248 > frame #18: 0x00007fff6ea72085 libdyld.dylib`start + 1{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)