[jira] [Created] (ARROW-8081) Fix memory size when using huge pages in plasma; other code cleanups
Siyuan Zhuang created ARROW-8081: Summary: Fix memory size when using huge pages in plasma; other code cleanups Key: ARROW-8081 URL: https://issues.apache.org/jira/browse/ARROW-8081 Project: Apache Arrow Issue Type: Bug Components: C++ - Plasma Reporter: Siyuan Zhuang Assignee: Siyuan Zhuang In the original code, 'PlasmaAllocator::SetFootprintLimit' happens before dealing with huge pages, which could change the memory limit. Also I found other dirty code needed to be written in a better way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8072) Add const constraint when parsing data
Siyuan Zhuang created ARROW-8072: Summary: Add const constraint when parsing data Key: ARROW-8072 URL: https://issues.apache.org/jira/browse/ARROW-8072 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Siyuan Zhuang Assignee: Siyuan Zhuang Input data for plasma protocol.h/protocol.cc should be const. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8030) Fix inconsistent comment style in plasma
Siyuan Zhuang created ARROW-8030: Summary: Fix inconsistent comment style in plasma Key: ARROW-8030 URL: https://issues.apache.org/jira/browse/ARROW-8030 Project: Apache Arrow Issue Type: Improvement Components: C++ - Plasma Reporter: Siyuan Zhuang Assignee: Siyuan Zhuang The comments in the plasma are a mixture of '@params' and '\params'. The reviewers required me to unify the style when I was trying to add windows support. I think it would be better to address it using a different PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang reassigned ARROW-4418: Assignee: Siyuan Zhuang > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Zhijun Fu >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770772#comment-16770772 ] Siyuan Zhuang edited comment on ARROW-4418 at 2/18/19 4:41 AM: --- I will try to create a PR soon. The only problem is the standalone asio does not include boost::bind, which might be used in our implementation (We have already used boost::bind in a similar case in Ray project, and the official asio examples also use it). I will try std::bind first if it is inevitable. was (Author: suquark): I will try to create a PR soon. The only problem is the standalone asio does not include boost::bind, which might be used in our implementation (We have already used boost::bind in a similar case in Ray project, and the official asio examples also use it). > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Zhijun Fu >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770772#comment-16770772 ] Siyuan Zhuang commented on ARROW-4418: -- I will try to create a PR soon. The only problem is the standalone asio does not include boost::bind, which might be used in our implementation (We have already used boost::bind in a similar case in Ray project, and the official asio examples also use it). > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Zhijun Fu >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756914#comment-16756914 ] Siyuan Zhuang commented on ARROW-4418: -- [~zhijunfu] I wonder if we could just move "client_connection" from Ray to Arrow, so we can share some common functions. > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2759) Export notification socket of Plasma
[ https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang reassigned ARROW-2759: Assignee: Siyuan Zhuang > Export notification socket of Plasma > > > Key: ARROW-2759 > URL: https://issues.apache.org/jira/browse/ARROW-2759 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++), Python >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, I am implementing an async interface for Ray. The implementation > needs some kind of message polling methods like `get_next_notification`. > Unfortunately, I find `get_next_notification` in > `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` > blocking, which is an impediment to implementing async utilities. Also, it's > hard to check the status of the socket (it could be closed or break up). So I > suggest export the notification socket so that there will be more flexibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2759) Export notification socket of Plasma
[ https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-2759: - Description: Currently, I am implementing an async interface for Ray. The implementation needs some kind of message polling methods like `get_next_notification`. Unfortunately, I find `get_next_notification` in `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` blocking, which is an impediment to implementing async utilities. Also, it's hard to check the status of the socket (it could be closed or break up). So I suggest export the notification socket so that there will be more flexibility. was: Currently, I am implementing an async interface for Ray. The implementation needs some kind of message polling methods like `get_next_notification`. Unfortunately, I find `get_next_notification` in `https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` blocking, which is an impediment to implementing async utilities. So I suggest adding some parameters like `timeout`. It could be done by operating its underlying socket. > Export notification socket of Plasma > > > Key: ARROW-2759 > URL: https://issues.apache.org/jira/browse/ARROW-2759 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++), Python >Reporter: Siyuan Zhuang >Priority: Major > > Currently, I am implementing an async interface for Ray. The implementation > needs some kind of message polling methods like `get_next_notification`. > Unfortunately, I find `get_next_notification` in > `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` > blocking, which is an impediment to implementing async utilities. Also, it's > hard to check the status of the socket (it could be closed or break up). So I > suggest export the notification socket so that there will be more flexibility. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2759) Export notification socket of Plasma
[ https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-2759: - Summary: Export notification socket of Plasma (was: Timeout for `get_next_notification()` in Plasma) > Export notification socket of Plasma > > > Key: ARROW-2759 > URL: https://issues.apache.org/jira/browse/ARROW-2759 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++), Python >Reporter: Siyuan Zhuang >Priority: Major > > Currently, I am implementing an async interface for Ray. The implementation > needs some kind of message polling methods like `get_next_notification`. > Unfortunately, I find `get_next_notification` in > `https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` > blocking, which is an impediment to implementing async utilities. So I > suggest adding some parameters like `timeout`. It could be done by operating > its underlying socket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3799) Improve `make_in_expression`
[ https://issues.apache.org/jira/browse/ARROW-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-3799: - Description: The `make_in_expression` in gandiva was not implemented correctly. Although [ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] has fixed part of it, further improvement is still necessary. See `test_in_expr_todo` in [python/pyarrow/tests/test_gandiva.py|https://github.com/apache/arrow/pull/2936/files#diff-9ab0e0dc1f329321ff4555b043ee0f41] for details. (was: The `make_in_expression` in gandiva was not implemented correctly. Although [ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] has fixed part of it, further improvement is still necessary.) > Improve `make_in_expression` > > > Key: ARROW-3799 > URL: https://issues.apache.org/jira/browse/ARROW-3799 > Project: Apache Arrow > Issue Type: Improvement > Components: Gandiva >Reporter: Siyuan Zhuang >Priority: Major > > The `make_in_expression` in gandiva was not implemented correctly. Although > [ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] > has fixed part of it, further improvement is still necessary. See > `test_in_expr_todo` in > [python/pyarrow/tests/test_gandiva.py|https://github.com/apache/arrow/pull/2936/files#diff-9ab0e0dc1f329321ff4555b043ee0f41] > for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3799) Improve `make_in_expression`
Siyuan Zhuang created ARROW-3799: Summary: Improve `make_in_expression` Key: ARROW-3799 URL: https://issues.apache.org/jira/browse/ARROW-3799 Project: Apache Arrow Issue Type: Improvement Components: Gandiva Reporter: Siyuan Zhuang The `make_in_expression` in gandiva was not implemented correctly. Although [ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] has fixed part of it, further improvement is still necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3765) [Gandiva] Segfault when the validity bitmap has not been allocated
[ https://issues.apache.org/jira/browse/ARROW-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang reassigned ARROW-3765: Assignee: Siyuan Zhuang > [Gandiva] Segfault when the validity bitmap has not been allocated > -- > > Key: ARROW-3765 > URL: https://issues.apache.org/jira/browse/ARROW-3765 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Gandiva >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This is because the `validity buffer` could be `None`: > {code} > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [None, ] > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [, 0x11a2b3228>]{code} > But Gandiva has not implemented it yet, thus accessing a nullptr: > {code} > void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const > arrow::ArrayData& array_data, EvalBatch* eval_batch) { > int buffer_idx = 0; > // TODO: > // - validity is optional > uint8_t* validity_buf = > const_cast(array_data.buffers[buffer_idx]->data()); > eval_batch->SetBuffer(desc.validity_idx(), validity_buf); > ++buffer_idx; > {code} > > Reproduce code: > {code:java} > frame_data = np.random.randint(0, 100, size=(2**22, 10)) > table = pa.Table.from_pandas(df) > filt = ... # Create any gandiva filter > r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # > segfault{code} > Backtrace: > {code:java} > * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x10) > * frame #0: 0x0001060184fc > libarrow.12.dylib`arrow::Buffer::data(this=0x) const at > buffer.h:162 > frame #1: 0x000106fbed78 > libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8, > desc=0x00010101e138, array_data=0x00010061f8e8, > eval_batch=0x000100796848) at annotator.cc:65 > frame #2: 0x000106fbf4ed > libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8, > record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94 > frame #3: 0x0001071449b7 > libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, > record_batch=0x0001007a45b8, output_vector=size=1) at > llvm_generator.cc:102 > frame #4: 0x000107059a4f > libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, > batch=0x0001007a45b8, > out_selection=std::__1::shared_ptr::element_type @ > 0x0001007a43e8 strong=2 weak=1) at filter.cc:106 > frame #5: 0x00010948e002 > gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*, > _object*, _object*) + 1986 > frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475 > frame #7: 0x0001001d28ca Python`call_function + 602 > frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #9: 0x0001001d3cf9 Python`fast_function + 569 > frame #10: 0x0001001d2899 Python`call_function + 553 > frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902 > frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48 > frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174 > frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277 > frame #16: 0x00010021ef46 Python`Py_Main + 3558 > frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248 > frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3765) [Gandiva] Segfault when the validity bitmap has not been allocated
[ https://issues.apache.org/jira/browse/ARROW-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-3765: - Summary: [Gandiva] Segfault when the validity bitmap has not been allocated (was: [Gandiva] Segfault when validity bitmap has not been allocated) > [Gandiva] Segfault when the validity bitmap has not been allocated > -- > > Key: ARROW-3765 > URL: https://issues.apache.org/jira/browse/ARROW-3765 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Gandiva >Reporter: Siyuan Zhuang >Priority: Major > Labels: pull-request-available > > This is because the `validity buffer` could be `None`: > {code} > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [None, ] > >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0) > >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() > [, 0x11a2b3228>]{code} > But Gandiva has not implemented it yet, thus accessing a nullptr: > {code} > void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const > arrow::ArrayData& array_data, EvalBatch* eval_batch) { > int buffer_idx = 0; > // TODO: > // - validity is optional > uint8_t* validity_buf = > const_cast(array_data.buffers[buffer_idx]->data()); > eval_batch->SetBuffer(desc.validity_idx(), validity_buf); > ++buffer_idx; > {code} > > Reproduce code: > {code:java} > frame_data = np.random.randint(0, 100, size=(2**22, 10)) > table = pa.Table.from_pandas(df) > filt = ... # Create any gandiva filter > r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # > segfault{code} > Backtrace: > {code:java} > * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS > (code=1, address=0x10) > * frame #0: 0x0001060184fc > libarrow.12.dylib`arrow::Buffer::data(this=0x) const at > buffer.h:162 > frame #1: 0x000106fbed78 > libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8, > desc=0x00010101e138, array_data=0x00010061f8e8, > eval_batch=0x000100796848) at annotator.cc:65 > frame #2: 0x000106fbf4ed > libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8, > record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94 > frame #3: 0x0001071449b7 > libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, > record_batch=0x0001007a45b8, output_vector=size=1) at > llvm_generator.cc:102 > frame #4: 0x000107059a4f > libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, > batch=0x0001007a45b8, > out_selection=std::__1::shared_ptr::element_type @ > 0x0001007a43e8 strong=2 weak=1) at filter.cc:106 > frame #5: 0x00010948e002 > gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*, > _object*, _object*) + 1986 > frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475 > frame #7: 0x0001001d28ca Python`call_function + 602 > frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #9: 0x0001001d3cf9 Python`fast_function + 569 > frame #10: 0x0001001d2899 Python`call_function + 553 > frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 > frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902 > frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48 > frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174 > frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277 > frame #16: 0x00010021ef46 Python`Py_Main + 3558 > frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248 > frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3765) Gandiva segfault when using int64 recordbatch as its input
Siyuan Zhuang created ARROW-3765: Summary: Gandiva segfault when using int64 recordbatch as its input Key: ARROW-3765 URL: https://issues.apache.org/jira/browse/ARROW-3765 Project: Apache Arrow Issue Type: Bug Components: C++, Gandiva Reporter: Siyuan Zhuang This is because the `validity buffer` could be `None`: {code} >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))) >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() [None, ] >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0) >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers() [, ]{code} But Gandiva has not implemented it yet, thus accessing a nullptr: {code} void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const arrow::ArrayData& array_data, EvalBatch* eval_batch) { int buffer_idx = 0; // TODO: // - validity is optional uint8_t* validity_buf = const_cast(array_data.buffers[buffer_idx]->data()); eval_batch->SetBuffer(desc.validity_idx(), validity_buf); ++buffer_idx; {code} Reproduce code: {code:java} frame_data = np.random.randint(0, 100, size=(2**22, 10)) table = pa.Table.from_pandas(df) filt = ... # Create any gandiva filter r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # segfault{code} Backtrace: {code:java} * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10) * frame #0: 0x0001060184fc libarrow.12.dylib`arrow::Buffer::data(this=0x) const at buffer.h:162 frame #1: 0x000106fbed78 libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8, desc=0x00010101e138, array_data=0x00010061f8e8, eval_batch=0x000100796848) at annotator.cc:65 frame #2: 0x000106fbf4ed libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8, record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94 frame #3: 0x0001071449b7 libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, record_batch=0x0001007a45b8, output_vector=size=1) at llvm_generator.cc:102 frame #4: 0x000107059a4f libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, batch=0x0001007a45b8, out_selection=std::__1::shared_ptr::element_type @ 0x0001007a43e8 strong=2 weak=1) at filter.cc:106 frame #5: 0x00010948e002 gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*, _object*, _object*) + 1986 frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475 frame #7: 0x0001001d28ca Python`call_function + 602 frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 frame #9: 0x0001001d3cf9 Python`fast_function + 569 frame #10: 0x0001001d2899 Python`call_function + 553 frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616 frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902 frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48 frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174 frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277 frame #16: 0x00010021ef46 Python`Py_Main + 3558 frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248 frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3751) Add more cython bindings for gandiva
Siyuan Zhuang created ARROW-3751: Summary: Add more cython bindings for gandiva Key: ARROW-3751 URL: https://issues.apache.org/jira/browse/ARROW-3751 Project: Apache Arrow Issue Type: Improvement Components: Gandiva, Python Reporter: Siyuan Zhuang Assignee: Siyuan Zhuang There are some cython bindings lost in ARROW-3602 (MakeAdd, MakeOr, MakeIn). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3742) Fix pyarrow.types & gandiva cython bindings
[ https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-3742: - Component/s: Python Gandiva > Fix pyarrow.types & gandiva cython bindings > --- > > Key: ARROW-3742 > URL: https://issues.apache.org/jira/browse/ARROW-3742 > Project: Apache Arrow > Issue Type: Bug > Components: Gandiva, Python >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > > 1. 'types.py' didn't export `_as_type`, causing failures in certain > cython/python combinations. I am surprised to see that the CI didn't fail. > 2. After updating the gandiva cpp part (ARROW-3587), the cython bindings > (ARROW-3602) are not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3742) Fix pyarrow.types & gandiva cython bindings
[ https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-3742: - Description: 1. 'types.py' didn't export `_as_type`, causing failures in certain cython/python combinations. I am surprised to see that the CI didn't fail. 2. After updating the gandiva cpp part (ARROW-3587), the cython bindings (ARROW-3602) are not consistent. was:After updating the gandiva cpp part (ARROW-3587), the cython bindings (ARROW-3602) are not consistent. Summary: Fix pyarrow.types & gandiva cython bindings (was: Fix gandiva cython bindings) > Fix pyarrow.types & gandiva cython bindings > --- > > Key: ARROW-3742 > URL: https://issues.apache.org/jira/browse/ARROW-3742 > Project: Apache Arrow > Issue Type: Bug >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > > 1. 'types.py' didn't export `_as_type`, causing failures in certain > cython/python combinations. I am surprised to see that the CI didn't fail. > 2. After updating the gandiva cpp part (ARROW-3587), the cython bindings > (ARROW-3602) are not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3742) Fix gandiva cython bindings
Siyuan Zhuang created ARROW-3742: Summary: Fix gandiva cython bindings Key: ARROW-3742 URL: https://issues.apache.org/jira/browse/ARROW-3742 Project: Apache Arrow Issue Type: Bug Reporter: Siyuan Zhuang After updating the gandiva cpp part (ARROW-3587), the cython bindings (ARROW-3602) are not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3742) Fix gandiva cython bindings
[ https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang reassigned ARROW-3742: Assignee: Siyuan Zhuang > Fix gandiva cython bindings > --- > > Key: ARROW-3742 > URL: https://issues.apache.org/jira/browse/ARROW-3742 > Project: Apache Arrow > Issue Type: Bug >Reporter: Siyuan Zhuang >Assignee: Siyuan Zhuang >Priority: Major > > After updating the gandiva cpp part (ARROW-3587), the cython bindings > (ARROW-3602) are not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3698) Segmentation fault when using a large table in Gandiva
[ https://issues.apache.org/jira/browse/ARROW-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyuan Zhuang updated ARROW-3698: - Summary: Segmentation fault when using a large table in Gandiva (was: Segmentation fault when using large table in Gandiva) > Segmentation fault when using a large table in Gandiva > -- > > Key: ARROW-3698 > URL: https://issues.apache.org/jira/browse/ARROW-3698 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Gandiva >Reporter: Siyuan Zhuang >Priority: Major > > {code} > >>> import pyarrow as pa > Registry has 519 pre-compiled functions > >>> import pandas as pd > >>> import numpy as np > >>> import pyarrow.gandiva as gandiva > >>> import timeit > >>> > >>> from matplotlib import pyplot as plt > >>> for scale in range(25, 26): > ... frame_data = 1.0 * np.random.randint(0, 100, size=(2**scale, 2)) > ... df = pd.DataFrame(frame_data).add_prefix("col") > ... table = pa.Table.from_pandas(df) > ... > >>> > >>> def float64_add(table): > ... builder = gandiva.TreeExprBuilder() > ... node_a = builder.make_field(table.schema.field_by_name("col0")) > ... node_b = builder.make_field(table.schema.field_by_name("col1")) > ... sum = builder.make_function(b"add", [node_a, node_b], pa.float64()) > ... field_result = pa.field("c", pa.float64()) > ... expr = builder.make_expression(sum, field_result) > ... projector = gandiva.make_projector(table.schema, [expr], > pa.default_memory_pool()) > ... return projector > ... > >>> projector = float64_add(table) > >>> projector.evaluate(table.to_batches()[0]) > [1] 36393 segmentation fault python{code} > It is because there is an integer overflow in Gandiva: > [https://github.com/apache/arrow/blob/1a6545aa51f5f41f0233ee0a11ef87d21127c5ed/cpp/src/gandiva/projector.cc#L141] > It should be `int64_t` instead of `int`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3698) Segmentation fault when using large table in Gandiva
Siyuan Zhuang created ARROW-3698: Summary: Segmentation fault when using large table in Gandiva Key: ARROW-3698 URL: https://issues.apache.org/jira/browse/ARROW-3698 Project: Apache Arrow Issue Type: Bug Components: C++, Gandiva Reporter: Siyuan Zhuang {code} >>> import pyarrow as pa Registry has 519 pre-compiled functions >>> import pandas as pd >>> import numpy as np >>> import pyarrow.gandiva as gandiva >>> import timeit >>> >>> from matplotlib import pyplot as plt >>> for scale in range(25, 26): ... frame_data = 1.0 * np.random.randint(0, 100, size=(2**scale, 2)) ... df = pd.DataFrame(frame_data).add_prefix("col") ... table = pa.Table.from_pandas(df) ... >>> >>> def float64_add(table): ... builder = gandiva.TreeExprBuilder() ... node_a = builder.make_field(table.schema.field_by_name("col0")) ... node_b = builder.make_field(table.schema.field_by_name("col1")) ... sum = builder.make_function(b"add", [node_a, node_b], pa.float64()) ... field_result = pa.field("c", pa.float64()) ... expr = builder.make_expression(sum, field_result) ... projector = gandiva.make_projector(table.schema, [expr], pa.default_memory_pool()) ... return projector ... >>> projector = float64_add(table) >>> projector.evaluate(table.to_batches()[0]) [1] 36393 segmentation fault python{code} It is because there is an integer overflow in Gandiva: [https://github.com/apache/arrow/blob/1a6545aa51f5f41f0233ee0a11ef87d21127c5ed/cpp/src/gandiva/projector.cc#L141] It should be `int64_t` instead of `int`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3587) Efficient serialization for Arrow Objects (array, table, tensor, etc)
Siyuan Zhuang created ARROW-3587: Summary: Efficient serialization for Arrow Objects (array, table, tensor, etc) Key: ARROW-3587 URL: https://issues.apache.org/jira/browse/ARROW-3587 Project: Apache Arrow Issue Type: Improvement Components: C++, Plasma (C++), Python Reporter: Siyuan Zhuang Currently, Arrow seems to have poor serialization support for its own objects. For example, {code} import pyarrow arr = pyarrow.array([1, 2, 3, 4]) pyarrow.serialize(arr) {code} Traceback (most recent call last): File "", line 1, in File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize File "pyarrow/serialization.pxi", line 136, in pyarrow.lib.SerializationContext._serialize_callback pyarrow.lib.SerializationCallbackError: pyarrow does not know how to serialize objects of type . I am working Ray & modin project, using plasma to store Arrow objects. Lack of direct serialization support harms the performance, so I would like to push a PR to fix this problem. I wonder if it is welcome or is there someone else doing it? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2759) Timeout for `get_next_notification()` in Plasma
Siyuan Zhuang created ARROW-2759: Summary: Timeout for `get_next_notification()` in Plasma Key: ARROW-2759 URL: https://issues.apache.org/jira/browse/ARROW-2759 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++), Python Reporter: Siyuan Zhuang Currently, I am implementing an async interface for Ray. The implementation needs some kind of message polling methods like `get_next_notification`. Unfortunately, I find `get_next_notification` in `https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` blocking, which is an impediment to implementing async utilities. So I suggest adding some parameters like `timeout`. It could be done by operating its underlying socket. -- This message was sent by Atlassian JIRA (v7.6.3#76005)