[jira] [Created] (ARROW-8081) Fix memory size when using huge pages in plasma; other code cleanups

2020-03-11 Thread Siyuan Zhuang (Jira)
Siyuan Zhuang created ARROW-8081:


 Summary: Fix memory size when using huge pages in plasma; other 
code cleanups
 Key: ARROW-8081
 URL: https://issues.apache.org/jira/browse/ARROW-8081
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Plasma
Reporter: Siyuan Zhuang
Assignee: Siyuan Zhuang


In the original code, 'PlasmaAllocator::SetFootprintLimit' happens before 
dealing with huge pages, which could change the memory limit. Also I found 
other dirty code needed to be written in a better way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8072) Add const constraint when parsing data

2020-03-11 Thread Siyuan Zhuang (Jira)
Siyuan Zhuang created ARROW-8072:


 Summary: Add const constraint when parsing data
 Key: ARROW-8072
 URL: https://issues.apache.org/jira/browse/ARROW-8072
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Siyuan Zhuang
Assignee: Siyuan Zhuang


Input data for plasma protocol.h/protocol.cc should be const.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8030) Fix inconsistent comment style in plasma

2020-03-07 Thread Siyuan Zhuang (Jira)
Siyuan Zhuang created ARROW-8030:


 Summary: Fix inconsistent comment style in plasma
 Key: ARROW-8030
 URL: https://issues.apache.org/jira/browse/ARROW-8030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++ - Plasma
Reporter: Siyuan Zhuang
Assignee: Siyuan Zhuang


The comments in the plasma are a mixture of '@params' and '\params'. The 
reviewers required me to unify the style when I was trying to add windows 
support. I think it would be better to address it using a different PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-02-17 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang reassigned ARROW-4418:


Assignee: Siyuan Zhuang

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Zhijun Fu
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-02-17 Thread Siyuan Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770772#comment-16770772
 ] 

Siyuan Zhuang edited comment on ARROW-4418 at 2/18/19 4:41 AM:
---

I will try to create a PR soon. The only problem is the standalone asio does 
not include boost::bind, which might be used in our implementation (We have 
already used boost::bind in a similar case in Ray project, and the official 
asio examples also use it). I will try std::bind first if it is inevitable.


was (Author: suquark):
I will try to create a PR soon. The only problem is the standalone asio does 
not include boost::bind, which might be used in our implementation (We have 
already used boost::bind in a similar case in Ray project, and the official 
asio examples also use it).

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Zhijun Fu
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-02-17 Thread Siyuan Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16770772#comment-16770772
 ] 

Siyuan Zhuang commented on ARROW-4418:
--

I will try to create a PR soon. The only problem is the standalone asio does 
not include boost::bind, which might be used in our implementation (We have 
already used boost::bind in a similar case in Ray project, and the official 
asio examples also use it).

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Zhijun Fu
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-30 Thread Siyuan Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16756914#comment-16756914
 ] 

Siyuan Zhuang commented on ARROW-4418:
--

[~zhijunfu] I wonder if we could just move "client_connection" from Ray to 
Arrow, so we can share some common functions.

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2759) Export notification socket of Plasma

2018-11-21 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang reassigned ARROW-2759:


Assignee: Siyuan Zhuang

> Export notification socket of Plasma
> 
>
> Key: ARROW-2759
> URL: https://issues.apache.org/jira/browse/ARROW-2759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, I am implementing an async interface for Ray. The implementation 
> needs some kind of message polling methods like `get_next_notification`.
>  Unfortunately, I find `get_next_notification` in 
> `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` 
> blocking, which is an impediment to implementing async utilities. Also, it's 
> hard to check the status of the socket (it could be closed or break up). So I 
> suggest export the notification socket so that there will be more flexibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2759) Export notification socket of Plasma

2018-11-21 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-2759:
-
Description: 
Currently, I am implementing an async interface for Ray. The implementation 
needs some kind of message polling methods like `get_next_notification`.
 Unfortunately, I find `get_next_notification` in 
`[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` 
blocking, which is an impediment to implementing async utilities. Also, it's 
hard to check the status of the socket (it could be closed or break up). So I 
suggest export the notification socket so that there will be more flexibility.

  was:
Currently, I am implementing an async interface for Ray. The implementation 
needs some kind of message polling methods like `get_next_notification`.
Unfortunately, I find  `get_next_notification` in 
`https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` 
blocking, which is an impediment to implementing async utilities. So I suggest 
adding some parameters like `timeout`. It could be done by operating its 
underlying socket.



> Export notification socket of Plasma
> 
>
> Key: ARROW-2759
> URL: https://issues.apache.org/jira/browse/ARROW-2759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Priority: Major
>
> Currently, I am implementing an async interface for Ray. The implementation 
> needs some kind of message polling methods like `get_next_notification`.
>  Unfortunately, I find `get_next_notification` in 
> `[https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx]` 
> blocking, which is an impediment to implementing async utilities. Also, it's 
> hard to check the status of the socket (it could be closed or break up). So I 
> suggest export the notification socket so that there will be more flexibility.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2759) Export notification socket of Plasma

2018-11-21 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-2759:
-
Summary: Export notification socket of Plasma  (was: Timeout for 
`get_next_notification()` in Plasma)

> Export notification socket of Plasma
> 
>
> Key: ARROW-2759
> URL: https://issues.apache.org/jira/browse/ARROW-2759
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Priority: Major
>
> Currently, I am implementing an async interface for Ray. The implementation 
> needs some kind of message polling methods like `get_next_notification`.
> Unfortunately, I find  `get_next_notification` in 
> `https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` 
> blocking, which is an impediment to implementing async utilities. So I 
> suggest adding some parameters like `timeout`. It could be done by operating 
> its underlying socket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3799) Improve `make_in_expression`

2018-11-14 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-3799:
-
Description: The `make_in_expression` in gandiva was not implemented 
correctly. Although 
[ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] 
has fixed part of it, further improvement is still necessary. See 
`test_in_expr_todo` in 
[python/pyarrow/tests/test_gandiva.py|https://github.com/apache/arrow/pull/2936/files#diff-9ab0e0dc1f329321ff4555b043ee0f41]
 for details.  (was: The `make_in_expression` in gandiva was not implemented 
correctly. Although 
[ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] 
has fixed part of it, further improvement is still necessary.)

> Improve `make_in_expression`
> 
>
> Key: ARROW-3799
> URL: https://issues.apache.org/jira/browse/ARROW-3799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Gandiva
>Reporter: Siyuan Zhuang
>Priority: Major
>
> The `make_in_expression` in gandiva was not implemented correctly. Although 
> [ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] 
> has fixed part of it, further improvement is still necessary. See 
> `test_in_expr_todo` in 
> [python/pyarrow/tests/test_gandiva.py|https://github.com/apache/arrow/pull/2936/files#diff-9ab0e0dc1f329321ff4555b043ee0f41]
>  for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3799) Improve `make_in_expression`

2018-11-14 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3799:


 Summary: Improve `make_in_expression`
 Key: ARROW-3799
 URL: https://issues.apache.org/jira/browse/ARROW-3799
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Gandiva
Reporter: Siyuan Zhuang


The `make_in_expression` in gandiva was not implemented correctly. Although 
[ARROW-3751|https://issues.apache.org/jira/projects/ARROW/issues/ARROW-3751] 
has fixed part of it, further improvement is still necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3765) [Gandiva] Segfault when the validity bitmap has not been allocated

2018-11-14 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang reassigned ARROW-3765:


Assignee: Siyuan Zhuang

> [Gandiva] Segfault when the validity bitmap has not been allocated
> --
>
> Key: ARROW-3765
> URL: https://issues.apache.org/jira/browse/ARROW-3765
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Gandiva
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is because the `validity buffer` could be `None`:
> {code}
> >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10)))
> >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
> [None, ]
> >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0)
> >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
> [,  0x11a2b3228>]{code}
> But Gandiva has not implemented it yet, thus accessing a nullptr:
> {code}
> void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const 
> arrow::ArrayData& array_data, EvalBatch* eval_batch) { 
> int buffer_idx = 0;
> // TODO:  
> // - validity is optional 
> uint8_t* validity_buf = 
> const_cast(array_data.buffers[buffer_idx]->data());
> eval_batch->SetBuffer(desc.validity_idx(), validity_buf);
> ++buffer_idx;
> {code}
>  
> Reproduce code:
> {code:java}
> frame_data = np.random.randint(0, 100, size=(2**22, 10))
> table = pa.Table.from_pandas(df)
> filt = ...  # Create any gandiva filter
> r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # 
> segfault{code}
>  Backtrace:
> {code:java}
> * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x10)
>  * frame #0: 0x0001060184fc 
> libarrow.12.dylib`arrow::Buffer::data(this=0x) const at 
> buffer.h:162
>  frame #1: 0x000106fbed78 
> libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8,
>  desc=0x00010101e138, array_data=0x00010061f8e8, 
> eval_batch=0x000100796848) at annotator.cc:65
>  frame #2: 0x000106fbf4ed 
> libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8,
>  record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94
>  frame #3: 0x0001071449b7 
> libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, 
> record_batch=0x0001007a45b8, output_vector=size=1) at 
> llvm_generator.cc:102
>  frame #4: 0x000107059a4f 
> libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, 
> batch=0x0001007a45b8, 
> out_selection=std::__1::shared_ptr::element_type @ 
> 0x0001007a43e8 strong=2 weak=1) at filter.cc:106
>  frame #5: 0x00010948e002 
> gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*,
>  _object*, _object*) + 1986
>  frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475
>  frame #7: 0x0001001d28ca Python`call_function + 602
>  frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
>  frame #9: 0x0001001d3cf9 Python`fast_function + 569
>  frame #10: 0x0001001d2899 Python`call_function + 553
>  frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
>  frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902
>  frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48
>  frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174
>  frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277
>  frame #16: 0x00010021ef46 Python`Py_Main + 3558
>  frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248
>  frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3765) [Gandiva] Segfault when the validity bitmap has not been allocated

2018-11-14 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-3765:
-
Summary: [Gandiva] Segfault when the validity bitmap has not been allocated 
 (was: [Gandiva] Segfault when validity bitmap has not been allocated)

> [Gandiva] Segfault when the validity bitmap has not been allocated
> --
>
> Key: ARROW-3765
> URL: https://issues.apache.org/jira/browse/ARROW-3765
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Gandiva
>Reporter: Siyuan Zhuang
>Priority: Major
>  Labels: pull-request-available
>
> This is because the `validity buffer` could be `None`:
> {code}
> >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10)))
> >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
> [None, ]
> >>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0)
> >>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
> [,  0x11a2b3228>]{code}
> But Gandiva has not implemented it yet, thus accessing a nullptr:
> {code}
> void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const 
> arrow::ArrayData& array_data, EvalBatch* eval_batch) { 
> int buffer_idx = 0;
> // TODO:  
> // - validity is optional 
> uint8_t* validity_buf = 
> const_cast(array_data.buffers[buffer_idx]->data());
> eval_batch->SetBuffer(desc.validity_idx(), validity_buf);
> ++buffer_idx;
> {code}
>  
> Reproduce code:
> {code:java}
> frame_data = np.random.randint(0, 100, size=(2**22, 10))
> table = pa.Table.from_pandas(df)
> filt = ...  # Create any gandiva filter
> r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # 
> segfault{code}
>  Backtrace:
> {code:java}
> * thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
> (code=1, address=0x10)
>  * frame #0: 0x0001060184fc 
> libarrow.12.dylib`arrow::Buffer::data(this=0x) const at 
> buffer.h:162
>  frame #1: 0x000106fbed78 
> libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8,
>  desc=0x00010101e138, array_data=0x00010061f8e8, 
> eval_batch=0x000100796848) at annotator.cc:65
>  frame #2: 0x000106fbf4ed 
> libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8,
>  record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94
>  frame #3: 0x0001071449b7 
> libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, 
> record_batch=0x0001007a45b8, output_vector=size=1) at 
> llvm_generator.cc:102
>  frame #4: 0x000107059a4f 
> libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, 
> batch=0x0001007a45b8, 
> out_selection=std::__1::shared_ptr::element_type @ 
> 0x0001007a43e8 strong=2 weak=1) at filter.cc:106
>  frame #5: 0x00010948e002 
> gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*,
>  _object*, _object*) + 1986
>  frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475
>  frame #7: 0x0001001d28ca Python`call_function + 602
>  frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
>  frame #9: 0x0001001d3cf9 Python`fast_function + 569
>  frame #10: 0x0001001d2899 Python`call_function + 553
>  frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
>  frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902
>  frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48
>  frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174
>  frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277
>  frame #16: 0x00010021ef46 Python`Py_Main + 3558
>  frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248
>  frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3765) Gandiva segfault when using int64 recordbatch as its input

2018-11-11 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3765:


 Summary: Gandiva segfault when using int64 recordbatch as its input
 Key: ARROW-3765
 URL: https://issues.apache.org/jira/browse/ARROW-3765
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Gandiva
Reporter: Siyuan Zhuang


This is because the `validity buffer` could be `None`:
{code}
>>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10)))
>>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
[None, ]
>>> df = pd.DataFrame(np.random.randint(0, 100, size=(2**12, 10))*1.0)
>>> pa.Table.from_pandas(df).to_batches()[0].column(0).buffers()
[, ]{code}
But Gandiva has not implemented it yet, thus accessing a nullptr:
{code}
void Annotator::PrepareBuffersForField(const FieldDescriptor& desc, const 
arrow::ArrayData& array_data, EvalBatch* eval_batch) { 
int buffer_idx = 0;
// TODO:  
// - validity is optional 
uint8_t* validity_buf = 
const_cast(array_data.buffers[buffer_idx]->data());
eval_batch->SetBuffer(desc.validity_idx(), validity_buf);
++buffer_idx;
{code}
 

Reproduce code:
{code:java}
frame_data = np.random.randint(0, 100, size=(2**22, 10))
table = pa.Table.from_pandas(df)
filt = ...  # Create any gandiva filter
r = filt.evaluate(table.to_batches()[0], pa.default_memory_pool()) # 
segfault{code}
 Backtrace:
{code:java}
* thread #2, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS 
(code=1, address=0x10)
 * frame #0: 0x0001060184fc 
libarrow.12.dylib`arrow::Buffer::data(this=0x) const at 
buffer.h:162
 frame #1: 0x000106fbed78 
libgandiva.12.dylib`gandiva::Annotator::PrepareBuffersForField(this=0x000100624dc8,
 desc=0x00010101e138, array_data=0x00010061f8e8, 
eval_batch=0x000100796848) at annotator.cc:65
 frame #2: 0x000106fbf4ed 
libgandiva.12.dylib`gandiva::Annotator::PrepareEvalBatch(this=0x000100624dc8,
 record_batch=0x0001007a45b8, out_vector=size=1) at annotator.cc:94
 frame #3: 0x0001071449b7 
libgandiva.12.dylib`gandiva::LLVMGenerator::Execute(this=0x000100624da0, 
record_batch=0x0001007a45b8, output_vector=size=1) at llvm_generator.cc:102
 frame #4: 0x000107059a4f 
libgandiva.12.dylib`gandiva::Filter::Evaluate(this=0x00010079c668, 
batch=0x0001007a45b8, 
out_selection=std::__1::shared_ptr::element_type @ 
0x0001007a43e8 strong=2 weak=1) at filter.cc:106
 frame #5: 0x00010948e002 
gandiva.cpython-36m-darwin.so`__pyx_pw_7pyarrow_7gandiva_6Filter_3evaluate(_object*,
 _object*, _object*) + 1986
 frame #6: 0x000100140e8b Python`_PyCFunction_FastCallDict + 475
 frame #7: 0x0001001d28ca Python`call_function + 602
 frame #8: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
 frame #9: 0x0001001d3cf9 Python`fast_function + 569
 frame #10: 0x0001001d2899 Python`call_function + 553
 frame #11: 0x0001001cf798 Python`_PyEval_EvalFrameDefault + 24616
 frame #12: 0x0001001d34c6 Python`_PyEval_EvalCodeWithName + 2902
 frame #13: 0x0001001c96e0 Python`PyEval_EvalCode + 48
 frame #14: 0x0001002029ae Python`PyRun_FileExFlags + 174
 frame #15: 0x000100201f75 Python`PyRun_SimpleFileExFlags + 277
 frame #16: 0x00010021ef46 Python`Py_Main + 3558
 frame #17: 0x00010e08 Python`___lldb_unnamed_symbol1$$Python + 248
 frame #18: 0x7fff6ea72085 libdyld.dylib`start + 1{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3751) Add more cython bindings for gandiva

2018-11-10 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3751:


 Summary: Add more cython bindings for gandiva
 Key: ARROW-3751
 URL: https://issues.apache.org/jira/browse/ARROW-3751
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Gandiva, Python
Reporter: Siyuan Zhuang
Assignee: Siyuan Zhuang


There are some cython bindings lost in ARROW-3602 (MakeAdd, MakeOr, MakeIn). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3742) Fix pyarrow.types & gandiva cython bindings

2018-11-09 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-3742:
-
Component/s: Python
 Gandiva

> Fix pyarrow.types & gandiva cython bindings
> ---
>
> Key: ARROW-3742
> URL: https://issues.apache.org/jira/browse/ARROW-3742
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva, Python
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>
> 1. 'types.py' didn't export `_as_type`, causing failures in certain 
> cython/python combinations. I am surprised to see that the CI didn't fail.
> 2. After updating the gandiva cpp part (ARROW-3587), the cython bindings 
> (ARROW-3602) are not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3742) Fix pyarrow.types & gandiva cython bindings

2018-11-09 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-3742:
-
Description: 
1. 'types.py' didn't export `_as_type`, causing failures in certain 
cython/python combinations. I am surprised to see that the CI didn't fail.
2. After updating the gandiva cpp part (ARROW-3587), the cython bindings 
(ARROW-3602) are not consistent.

  was:After updating the gandiva cpp part (ARROW-3587), the cython bindings 
(ARROW-3602) are not consistent.

Summary: Fix pyarrow.types & gandiva cython bindings  (was: Fix gandiva 
cython bindings)

> Fix pyarrow.types & gandiva cython bindings
> ---
>
> Key: ARROW-3742
> URL: https://issues.apache.org/jira/browse/ARROW-3742
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>
> 1. 'types.py' didn't export `_as_type`, causing failures in certain 
> cython/python combinations. I am surprised to see that the CI didn't fail.
> 2. After updating the gandiva cpp part (ARROW-3587), the cython bindings 
> (ARROW-3602) are not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3742) Fix gandiva cython bindings

2018-11-09 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3742:


 Summary: Fix gandiva cython bindings
 Key: ARROW-3742
 URL: https://issues.apache.org/jira/browse/ARROW-3742
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siyuan Zhuang


After updating the gandiva cpp part (ARROW-3587), the cython bindings 
(ARROW-3602) are not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3742) Fix gandiva cython bindings

2018-11-09 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang reassigned ARROW-3742:


Assignee: Siyuan Zhuang

> Fix gandiva cython bindings
> ---
>
> Key: ARROW-3742
> URL: https://issues.apache.org/jira/browse/ARROW-3742
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Siyuan Zhuang
>Assignee: Siyuan Zhuang
>Priority: Major
>
> After updating the gandiva cpp part (ARROW-3587), the cython bindings 
> (ARROW-3602) are not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3698) Segmentation fault when using a large table in Gandiva

2018-11-03 Thread Siyuan Zhuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siyuan Zhuang updated ARROW-3698:
-
Summary: Segmentation fault when using a large table in Gandiva  (was: 
Segmentation fault when using large table in Gandiva)

> Segmentation fault when using a large table in Gandiva
> --
>
> Key: ARROW-3698
> URL: https://issues.apache.org/jira/browse/ARROW-3698
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Gandiva
>Reporter: Siyuan Zhuang
>Priority: Major
>
> {code}
> >>> import pyarrow as pa
> Registry has 519 pre-compiled functions
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow.gandiva as gandiva
> >>> import timeit
> >>>
> >>> from matplotlib import pyplot as plt
> >>> for scale in range(25, 26):
> ... frame_data = 1.0 * np.random.randint(0, 100, size=(2**scale, 2))
> ... df = pd.DataFrame(frame_data).add_prefix("col")
> ... table = pa.Table.from_pandas(df)
> ...
> >>>
> >>> def float64_add(table):
> ... builder = gandiva.TreeExprBuilder()
> ... node_a = builder.make_field(table.schema.field_by_name("col0"))
> ... node_b = builder.make_field(table.schema.field_by_name("col1"))
> ... sum = builder.make_function(b"add", [node_a, node_b], pa.float64())
> ... field_result = pa.field("c", pa.float64())
> ... expr = builder.make_expression(sum, field_result)
> ... projector = gandiva.make_projector(table.schema, [expr], 
> pa.default_memory_pool())
> ... return projector
> ...
> >>> projector = float64_add(table)
> >>> projector.evaluate(table.to_batches()[0])
> [1] 36393 segmentation fault python{code}
> It is because there is an integer overflow in Gandiva:
> [https://github.com/apache/arrow/blob/1a6545aa51f5f41f0233ee0a11ef87d21127c5ed/cpp/src/gandiva/projector.cc#L141]
> It should be `int64_t` instead of `int`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3698) Segmentation fault when using large table in Gandiva

2018-11-03 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3698:


 Summary: Segmentation fault when using large table in Gandiva
 Key: ARROW-3698
 URL: https://issues.apache.org/jira/browse/ARROW-3698
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Gandiva
Reporter: Siyuan Zhuang


{code}
>>> import pyarrow as pa
Registry has 519 pre-compiled functions
>>> import pandas as pd
>>> import numpy as np
>>> import pyarrow.gandiva as gandiva
>>> import timeit
>>>
>>> from matplotlib import pyplot as plt
>>> for scale in range(25, 26):
... frame_data = 1.0 * np.random.randint(0, 100, size=(2**scale, 2))
... df = pd.DataFrame(frame_data).add_prefix("col")
... table = pa.Table.from_pandas(df)
...
>>>
>>> def float64_add(table):
... builder = gandiva.TreeExprBuilder()
... node_a = builder.make_field(table.schema.field_by_name("col0"))
... node_b = builder.make_field(table.schema.field_by_name("col1"))
... sum = builder.make_function(b"add", [node_a, node_b], pa.float64())
... field_result = pa.field("c", pa.float64())
... expr = builder.make_expression(sum, field_result)
... projector = gandiva.make_projector(table.schema, [expr], 
pa.default_memory_pool())
... return projector
...
>>> projector = float64_add(table)
>>> projector.evaluate(table.to_batches()[0])
[1] 36393 segmentation fault python{code}
It is because there is an integer overflow in Gandiva:
[https://github.com/apache/arrow/blob/1a6545aa51f5f41f0233ee0a11ef87d21127c5ed/cpp/src/gandiva/projector.cc#L141]

It should be `int64_t` instead of `int`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3587) Efficient serialization for Arrow Objects (array, table, tensor, etc)

2018-10-22 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3587:


 Summary: Efficient serialization for Arrow Objects (array, table, 
tensor, etc)
 Key: ARROW-3587
 URL: https://issues.apache.org/jira/browse/ARROW-3587
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Plasma (C++), Python
Reporter: Siyuan Zhuang


Currently, Arrow seems to have poor serialization support for its own objects.

For example,
  
{code}
import pyarrow 
arr = pyarrow.array([1, 2, 3, 4]) 
pyarrow.serialize(arr)
{code}
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize
 File "pyarrow/serialization.pxi", line 136, in 
pyarrow.lib.SerializationContext._serialize_callback
 pyarrow.lib.SerializationCallbackError: pyarrow does not know how to serialize 
objects of type .

I am working Ray & modin project, using plasma to store Arrow objects. Lack of 
direct serialization support harms the performance, so I would like to push a 
PR to fix this problem.
I wonder if it is welcome or is there someone else doing it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2759) Timeout for `get_next_notification()` in Plasma

2018-06-28 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-2759:


 Summary: Timeout for `get_next_notification()` in Plasma
 Key: ARROW-2759
 URL: https://issues.apache.org/jira/browse/ARROW-2759
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++), Python
Reporter: Siyuan Zhuang


Currently, I am implementing an async interface for Ray. The implementation 
needs some kind of message polling methods like `get_next_notification`.
Unfortunately, I find  `get_next_notification` in 
`https://github.com/apache/arrow/blob/master/python/pyarrow/_plasma.pyx` 
blocking, which is an impediment to implementing async utilities. So I suggest 
adding some parameters like `timeout`. It could be done by operating its 
underlying socket.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)