[jira] [Updated] (ARROW-11354) [Rust] Speed-up casts of dates and times

2021-01-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11354:
---
Labels: pull-request-available  (was: )

> [Rust] Speed-up casts of dates and times
> 
>
> Key: ARROW-11354
> URL: https://issues.apache.org/jira/browse/ARROW-11354
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11354) [Rust] Speed-up casts of dates and times

2021-01-22 Thread Jira
Jorge Leitão created ARROW-11354:


 Summary: [Rust] Speed-up casts of dates and times
 Key: ARROW-11354
 URL: https://issues.apache.org/jira/browse/ARROW-11354
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Jorge Leitão
Assignee: Jorge Leitão






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10605) [C++][Gandiva] Support Decimal256 type in gandiva computation.

2021-01-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270546#comment-17270546
 ] 

Micah Kornfield commented on ARROW-10605:
-

sorry for the delayed reply.  [~klykov] this is mostly looking into what 
operations gandiva currently supports and replicating them for Decimal256 
(there are still some basic math/logic operations that aren't supported).  
Probably a few sub-work items here.

> [C++][Gandiva] Support Decimal256 type in gandiva computation.
> --
>
> Key: ARROW-10605
> URL: https://issues.apache.org/jira/browse/ARROW-10605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: Micah Kornfield
>Priority: Major
>
> There might be a lot of work here, so sub-jiras might be added once scope is 
> determined.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11179) [Format] Make comments in fb files friendly to rust doc

2021-01-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270543#comment-17270543
 ] 

Micah Kornfield commented on ARROW-11179:
-

It is OK with me if you want to open a PR.  I don't we rely on the formatting 
for other languages but I could be wrong.  [~uwe] or [~apitrou] might know 
better.

> [Format] Make comments in fb files friendly to rust doc
> ---
>
> Key: ARROW-11179
> URL: https://issues.apache.org/jira/browse/ARROW-11179
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Qingyou Meng
>Priority: Trivial
> Attachments: format-0ed34c83.patch
>
>
> Currently, comments in flatbuffer files are directly copied to rust and c++ 
> source codes.
> That's great but there are some problems with cargo doc:
>  * array element abc[1] or link label [smith2017knl] causes `broken intra doc 
> links` warning
>  * example code/figure blocks are flatten into one line, see example [arrow 
> 2.0.0 
> doc|https://docs.rs/arrow/2.0.0/arrow/ipc/gen/SparseTensor/struct.SparseTensorIndexCSF.html#method.indptrType]
> After flatc generating, those ipc files have to be updated manually to fix 
> the above problems.
> So I'm suggesting update flatbuffer comments to address this problem.
>  * Escape inline code with ` and `
>  * Escape text block with ```text and ```
>  * change {color:#00875a}[smith2017knl]:{color} 
> [http://shaden.io/pub-files/smith2017knl.pdf] to 
> {color:#403294}[smith2017knl]({color}{color:#403294}[http://shaden.io/pub-files/smith2017knl.pdf]){color}
> {color:#172b4d}The attachment file *format-0ed34c83.patch*{color} is created 
> with git command
> {code:java}
> git diff 0ed34c83 -p format > format-0ed34c83.patch{code}
> where *0ed34c83* is this commit: 
> {noformat}
> 0ed34c83c ARROW-9400: [Python] Do not depend on conda-forge static libraries 
> in Windows wheel builds{noformat}
> [~emkornfield] may I create a pull request for this?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11353) [C++][Python][Parquet] We should allow for overriding to large types by providing a schema

2021-01-22 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-11353:
---

 Summary: [C++][Python][Parquet] We should allow for overriding to 
large types by providing a schema
 Key: ARROW-11353
 URL: https://issues.apache.org/jira/browse/ARROW-11353
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Micah Kornfield


{{The following shouldn't throw}}

{{>>> import pyarrow as pa}}
{{>>> import pyarrow.parquet as pq}}
{{>>> import pyarrow.dataset as ds}}
{{>>> pa.__version__}}
{{'2.0.0'}}
{{>>> schema = pa.schema([pa.field("utf8", pa.utf8())])}}
{{>>> table = pa.Table.from_pydict(\{"utf8": ["foo", "bar"]}, schema)}}
{{>>> pq.write_table(table, "/tmp/example.parquet")}}
{{>>> large_schema = pa.schema([pa.field("utf8", pa.large_utf8())])}}
{{>>> ds.dataset("/tmp/example.parquet", schema=large_schema,}}
{{format="parquet").to_table()}}
{{Traceback (most recent call last):}}
{{  File "", line 1, in }}
{{  File "pyarrow/_dataset.pyx", line 405, in}}
{{pyarrow._dataset.Dataset.to_table}}
{{  File "pyarrow/_dataset.pyx", line 2262, in}}
{{pyarrow._dataset.Scanner.to_table}}
{{  File "pyarrow/error.pxi", line 122, in}}
{{pyarrow.lib.pyarrow_internal_check_status}}
{{  File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status}}
{{pyarrow.lib.ArrowTypeError: fields had matching names but differing types.}}
{{From: utf8: string To: utf8: large_string}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11066) [Java] Is there a bug in flight AddWritableBuffer

2021-01-22 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-11066:
-
Fix Version/s: 4.0.0

> [Java] Is there a bug in flight AddWritableBuffer
> -
>
> Key: ARROW-11066
> URL: https://issues.apache.org/jira/browse/ARROW-11066
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 1.0.0
>Reporter: Kangping Huang
>Assignee: David Li
>Priority: Major
> Fix For: 4.0.0
>
>
> [https://github.com/apache/arrow/blob/9bab12f03ac486bb8270f031b83f0a0411766b3e/java/flight/flight-core/src/main/java/org/apache/arrow/flight/grpc/AddWritableBuffer.java#L94]
> buf.readBytes(stream, buf.readableBytes());
> is this line redundant
> In my perf.svg, this will copy the data from buf to OutputStream, which can 
> not realize zero-copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11066) [Java] Is there a bug in flight AddWritableBuffer

2021-01-22 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270500#comment-17270500
 ] 

David Li commented on ARROW-11066:
--

Indeed, you seem to be right, and furthermore, that line seems to defeat the 
optimization the method purports to implement in the first place! The error 
seems to have been present since the original Flight implementation. I'd 
surmise it was maybe a bad refactor or half-completed attempt at making 
{{AddWriteableBuffer#add}} handle the fallback path for you.

> [Java] Is there a bug in flight AddWritableBuffer
> -
>
> Key: ARROW-11066
> URL: https://issues.apache.org/jira/browse/ARROW-11066
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 1.0.0
>Reporter: Kangping Huang
>Priority: Major
>
> [https://github.com/apache/arrow/blob/9bab12f03ac486bb8270f031b83f0a0411766b3e/java/flight/flight-core/src/main/java/org/apache/arrow/flight/grpc/AddWritableBuffer.java#L94]
> buf.readBytes(stream, buf.readableBytes());
> is this line redundant
> In my perf.svg, this will copy the data from buf to OutputStream, which can 
> not realize zero-copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11066) [Java] Is there a bug in flight AddWritableBuffer

2021-01-22 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-11066:


Assignee: David Li

> [Java] Is there a bug in flight AddWritableBuffer
> -
>
> Key: ARROW-11066
> URL: https://issues.apache.org/jira/browse/ARROW-11066
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Java
>Affects Versions: 1.0.0
>Reporter: Kangping Huang
>Assignee: David Li
>Priority: Major
>
> [https://github.com/apache/arrow/blob/9bab12f03ac486bb8270f031b83f0a0411766b3e/java/flight/flight-core/src/main/java/org/apache/arrow/flight/grpc/AddWritableBuffer.java#L94]
> buf.readBytes(stream, buf.readableBytes());
> is this line redundant
> In my perf.svg, this will copy the data from buf to OutputStream, which can 
> not realize zero-copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Brian Hulette (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270480#comment-17270480
 ] 

Brian Hulette commented on ARROW-11347:
---

Ah, you mean when accessing a Row, e.g. table.get(0)

I _think_ the choice of Map was for code-reuse between Struct vectors and Map 
vectors ([~paul.e.taylor] wrote this, he could comment more certainly). Note I 
also added the ability to access the fields in a row view "by attribute" in 
Python parlance in https://github.com/apache/arrow/pull/2197. So if you have a 
table with a "foo" field you can access it in a Row view with either 
table.get(0)["foo"] or table.get(0).foo. I'm pretty sure I actually added that 
in response to a perf measurement from Jeff back in 2018.

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11352) Implementation status?

2021-01-22 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11352:
--

 Summary: Implementation status?
 Key: ARROW-11352
 URL: https://issues.apache.org/jira/browse/ARROW-11352
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Dominik Moritz


https://arrow.apache.org/docs/status.html says that the Rust implementation 
doesn't support anything except CSV R/W. Is that true? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11351) Reconsider proxy objects instead of defineProperty

2021-01-22 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-11351:
---
Description: I was wondering why Arrow uses Proxy objects instead of 
defineProperty, which was a bit faster in the experiments at 
https://observablehq.com/@jheer/from-apache-arrow-to-javascript-objects. I 
don't know whether a change makes sense but I would love to know the design 
rationale since I couldn't find anything in the issues or on GitHub about it.   
(was: Related to https://issues.apache.org/jira/browse/ARROW-11347

I was wondering why Arrow uses Proxy objects instead of defineProperty, which 
was a bit faster in the experiments at 
https://observablehq.com/@jheer/from-apache-arrow-to-javascript-objects. I 
don't know whether a change makes sense but I would love to know the design 
rationale since I couldn't find anything in the issues or on GitHub about it. )

> Reconsider proxy objects instead of defineProperty
> --
>
> Key: ARROW-11351
> URL: https://issues.apache.org/jira/browse/ARROW-11351
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>
> I was wondering why Arrow uses Proxy objects instead of defineProperty, which 
> was a bit faster in the experiments at 
> https://observablehq.com/@jheer/from-apache-arrow-to-javascript-objects. I 
> don't know whether a change makes sense but I would love to know the design 
> rationale since I couldn't find anything in the issues or on GitHub about it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11351) Reconsider proxy objects instead of defineProperty

2021-01-22 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11351:
--

 Summary: Reconsider proxy objects instead of defineProperty
 Key: ARROW-11351
 URL: https://issues.apache.org/jira/browse/ARROW-11351
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Dominik Moritz


Related to https://issues.apache.org/jira/browse/ARROW-11347

I was wondering why Arrow uses Proxy objects instead of defineProperty, which 
was a bit faster in the experiments at 
https://observablehq.com/@jheer/from-apache-arrow-to-javascript-objects. I 
don't know whether a change makes sense but I would love to know the design 
rationale since I couldn't find anything in the issues or on GitHub about it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Dominik Moritz (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270450#comment-17270450
 ] 

Dominik Moritz commented on ARROW-11347:


Yes, when accessing an element of an array (e.g. after `toArray()`). 

Before making the change, someone needs to look closer into the performance 
benefits and usability. Jeff created his own parser for Arquero, which can make 
some simplifying assumptions and is less general but also almost twice as fast 
(https://github.com/uwdata/arquero-arrow/tree/main/perf). It would be good to 
figure out why. I think Maps are generally nicer than Objects for users so 
maybe it's worth the performance difference. It would be great if you could 
share how you decided on Maps in the first place. 

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11350) [C++] Bump dependency versions

2021-01-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11350:
---
Labels: pull-request-available  (was: )

> [C++] Bump dependency versions
> --
>
> Key: ARROW-11350
> URL: https://issues.apache.org/jira/browse/ARROW-11350
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11350) [C++] Bump dependency versions

2021-01-22 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-11350:
---

 Summary: [C++] Bump dependency versions
 Key: ARROW-11350
 URL: https://issues.apache.org/jira/browse/ARROW-11350
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11075) [Python] Getting reference not found with OCR enabled pyarrow

2021-01-22 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270419#comment-17270419
 ] 

Uwe Korn commented on ARROW-11075:
--

The latest ORC release is supporting shared linkage and the conda toolchain has 
been reworked to link dynamically: 
https://github.com/conda-forge/arrow-cpp-feedstock/blob/1.0.x/recipe/meta.yaml. 
The major issue here is probably that ORC 0.6.2 is built as part of the Arrow 
thirdparty toolchain but 0.6.6 headers are used during the build. Not sure how 
this links but that feels like the most likely issue to me.

> [Python] Getting reference not found with OCR enabled pyarrow
> -
>
> Key: ARROW-11075
> URL: https://issues.apache.org/jira/browse/ARROW-11075
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: PPC64LE
>Reporter: Kandarpa
>Priority: Major
> Attachments: arrow_cpp_build.log, arrow_python_build.log, 
> conda_list.txt
>
>
> Generated the pyarrow with OCR enabled on Power using following steps:
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> mkdir cpp/build
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>       -DCMAKE_INSTALL_LIBDIR=lib \
>       -DARROW_WITH_BZ2=ON \
>       -DARROW_WITH_ZLIB=ON \
>       -DARROW_WITH_ZSTD=ON \
>       -DARROW_WITH_LZ4=ON \
>       -DARROW_WITH_SNAPPY=ON \
>       -DARROW_WITH_BROTLI=ON \
>       -DARROW_PARQUET=ON \
>       -DARROW_PYTHON=ON \
>       -DARROW_BUILD_TESTS=ON \
>       -DARROW_CUDA=ON \
>       -DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs/libcuda.so \
>       -DARROW_ORC=ON \
>   ..
> make -j
> make install
> cd ../../python
> python setup.py build_ext --bundle-arrow-cpp --with-orc --with-cuda 
> --with-parquet bdist_wheel
> {code}
>  
>  
> With the generated whl package installed, ran CUDF tests and observed 
> following error:
> *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
> Please find the whole error log below:
> 
>  ERRORS 
> 
>   ERROR 
> collecting test session 
> _
>  /conda/envs/rmm/lib/python3.7/importlib/__init__.py:127: in import_module
>      return _bootstrap._gcd_import(name[level:], package, level)
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :967: in _find_and_load_unlocked
>      ???
>  :677: in _load_unlocked
>      ???
>  :728: in exec_module
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  cudf/cudf/__init__.py:60: in 
>      from cudf.io import (
>  cudf/cudf/io/__init__.py:8: in 
>      from cudf.io.orc import read_orc, read_orc_metadata, to_orc
>  cudf/cudf/io/orc.py:6: in 
>      from pyarrow import orc as orc
>  /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/orc.py:24: in 
>      import pyarrow._orc as _orc
>  {color:#de350b}E   ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: 
> _ZN5arrow8adapters3orc13ORCFileReader4ReadEPSt10shared_ptrINS_5TableEE{color}
>  === 
> short test summary info 
> 
>  *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
>   
> Interrupted: 1 error during collection 
> 
>  === 
> 1 error in 1.54s 
> ===
>  Fatal Python error: Segmentation fault



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11075) [Python] Getting reference not found with OCR enabled pyarrow

2021-01-22 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270414#comment-17270414
 ] 

Wes McKinney commented on ARROW-11075:
--

ORC is supported to be statically linked, so this would be unusual. 

[~kandarpamalipeddi] can you show what ORC symbols are in your shared library?

{code}
nm -D /path/to/libarrow.so | c++filt | grep orc
{code}

Check also which libarrow.so the pyarrow libraries are linking to if you can 
(with {{ldd}})

> [Python] Getting reference not found with OCR enabled pyarrow
> -
>
> Key: ARROW-11075
> URL: https://issues.apache.org/jira/browse/ARROW-11075
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: PPC64LE
>Reporter: Kandarpa
>Priority: Major
> Attachments: arrow_cpp_build.log, arrow_python_build.log, 
> conda_list.txt
>
>
> Generated the pyarrow with OCR enabled on Power using following steps:
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> mkdir cpp/build
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>       -DCMAKE_INSTALL_LIBDIR=lib \
>       -DARROW_WITH_BZ2=ON \
>       -DARROW_WITH_ZLIB=ON \
>       -DARROW_WITH_ZSTD=ON \
>       -DARROW_WITH_LZ4=ON \
>       -DARROW_WITH_SNAPPY=ON \
>       -DARROW_WITH_BROTLI=ON \
>       -DARROW_PARQUET=ON \
>       -DARROW_PYTHON=ON \
>       -DARROW_BUILD_TESTS=ON \
>       -DARROW_CUDA=ON \
>       -DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs/libcuda.so \
>       -DARROW_ORC=ON \
>   ..
> make -j
> make install
> cd ../../python
> python setup.py build_ext --bundle-arrow-cpp --with-orc --with-cuda 
> --with-parquet bdist_wheel
> {code}
>  
>  
> With the generated whl package installed, ran CUDF tests and observed 
> following error:
> *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
> Please find the whole error log below:
> 
>  ERRORS 
> 
>   ERROR 
> collecting test session 
> _
>  /conda/envs/rmm/lib/python3.7/importlib/__init__.py:127: in import_module
>      return _bootstrap._gcd_import(name[level:], package, level)
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :967: in _find_and_load_unlocked
>      ???
>  :677: in _load_unlocked
>      ???
>  :728: in exec_module
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  cudf/cudf/__init__.py:60: in 
>      from cudf.io import (
>  cudf/cudf/io/__init__.py:8: in 
>      from cudf.io.orc import read_orc, read_orc_metadata, to_orc
>  cudf/cudf/io/orc.py:6: in 
>      from pyarrow import orc as orc
>  /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/orc.py:24: in 
>      import pyarrow._orc as _orc
>  {color:#de350b}E   ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: 
> _ZN5arrow8adapters3orc13ORCFileReader4ReadEPSt10shared_ptrINS_5TableEE{color}
>  === 
> short test summary info 
> 
>  *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
>   
> Interrupted: 1 error during collection 
> 
>  === 
> 1 error in 1.54s 
> ===
>  Fatal Python error: Segmentation fault



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11075) [Python] Getting reference not found with OCR enabled pyarrow

2021-01-22 Thread Uwe Korn (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270411#comment-17270411
 ] 

Uwe Korn commented on ARROW-11075:
--

I would guess that the issue is related to {{-DORC_SOURCE=BUNDLED}} and having 
{{orc}} installed as a conda package at the same time. Can you remove the 
{{-DORC_SOURCE=BUNDLED}} flag and do a clean build? Do you know why you have 
set that?

> [Python] Getting reference not found with OCR enabled pyarrow
> -
>
> Key: ARROW-11075
> URL: https://issues.apache.org/jira/browse/ARROW-11075
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.1
> Environment: PPC64LE
>Reporter: Kandarpa
>Priority: Major
> Attachments: arrow_cpp_build.log, arrow_python_build.log, 
> conda_list.txt
>
>
> Generated the pyarrow with OCR enabled on Power using following steps:
> {code:java}
> export ARROW_HOME=$CONDA_PREFIX
> mkdir cpp/build
> cd cpp/build
> cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
>       -DCMAKE_INSTALL_LIBDIR=lib \
>       -DARROW_WITH_BZ2=ON \
>       -DARROW_WITH_ZLIB=ON \
>       -DARROW_WITH_ZSTD=ON \
>       -DARROW_WITH_LZ4=ON \
>       -DARROW_WITH_SNAPPY=ON \
>       -DARROW_WITH_BROTLI=ON \
>       -DARROW_PARQUET=ON \
>       -DARROW_PYTHON=ON \
>       -DARROW_BUILD_TESTS=ON \
>       -DARROW_CUDA=ON \
>       -DCUDA_CUDA_LIBRARY=/usr/local/cuda/lib64/stubs/libcuda.so \
>       -DARROW_ORC=ON \
>   ..
> make -j
> make install
> cd ../../python
> python setup.py build_ext --bundle-arrow-cpp --with-orc --with-cuda 
> --with-parquet bdist_wheel
> {code}
>  
>  
> With the generated whl package installed, ran CUDF tests and observed 
> following error:
> *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
> Please find the whole error log below:
> 
>  ERRORS 
> 
>   ERROR 
> collecting test session 
> _
>  /conda/envs/rmm/lib/python3.7/importlib/__init__.py:127: in import_module
>      return _bootstrap._gcd_import(name[level:], package, level)
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :953: in _find_and_load_unlocked
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  :1006: in _gcd_import
>      ???
>  :983: in _find_and_load
>      ???
>  :967: in _find_and_load_unlocked
>      ???
>  :677: in _load_unlocked
>      ???
>  :728: in exec_module
>      ???
>  :219: in _call_with_frames_removed
>      ???
>  cudf/cudf/__init__.py:60: in 
>      from cudf.io import (
>  cudf/cudf/io/__init__.py:8: in 
>      from cudf.io.orc import read_orc, read_orc_metadata, to_orc
>  cudf/cudf/io/orc.py:6: in 
>      from pyarrow import orc as orc
>  /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/orc.py:24: in 
>      import pyarrow._orc as _orc
>  {color:#de350b}E   ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: 
> _ZN5arrow8adapters3orc13ORCFileReader4ReadEPSt10shared_ptrINS_5TableEE{color}
>  === 
> short test summary info 
> 
>  *_ERROR cudf - ImportError: 
> /conda/envs/rmm/lib/python3.7/site-packages/pyarrow/_orc.cpython-37m-powerpc64le-linux-gnu.so:
>  undefined symbol: _ZN5arrow8adapters3orc13OR..._*
>   
> Interrupted: 1 error during collection 
> 
>  === 
> 1 error in 1.54s 
> ===
>  Fatal Python error: Segmentation fault



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11299) [Python] build warning in python

2021-01-22 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-11299.
--
Fix Version/s: (was: 4.0.0)
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 9274
[https://github.com/apache/arrow/pull/9274]

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_ba

[jira] [Updated] (ARROW-8919) [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke

2021-01-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8919:
--
Labels: compute pull-request-available  (was: compute)

> [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that 
> may require implicit casts to invoke
> --
>
> Key: ARROW-8919
> URL: https://issues.apache.org/jira/browse/ARROW-8919
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Wes McKinney
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently we have "DispatchExact" which requires an exact match of input 
> types. "DispatchBest" would permit kernel selection with implicit casts 
> required. Since multiple kernels may be valid when allowing implicit casts, 
> we will need to break ties by estimating the "cost" of the implicit casts. 
> For example, casting int8 to int32 is "less expensive" than implicitly 
> casting to int64



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11349) [Rust] Add from_iter_values to create arrays from T instead of Option

2021-01-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11349:
---
Labels: pull-request-available  (was: )

> [Rust] Add from_iter_values to create arrays from T instead of Option
> 
>
> Key: ARROW-11349
> URL: https://issues.apache.org/jira/browse/ARROW-11349
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In that case we don't have to allocate a null buffer / set bits, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11349) [Rust] Add from_iter_values to create arrays from T instead of Option

2021-01-22 Thread Jira
Daniël Heres created ARROW-11349:


 Summary: [Rust] Add from_iter_values to create arrays from T 
instead of Option
 Key: ARROW-11349
 URL: https://issues.apache.org/jira/browse/ARROW-11349
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Daniël Heres
Assignee: Daniël Heres


In that case we don't have to allocate a null buffer / set bits, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11348) [C++] Add pretty printing support for gdb

2021-01-22 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-11348:

Description: Parsing the GDB output is error prone and can take 
considerable time.  Also, some information is difficult or non-intuitive to get 
to (e.g. the name of a data type).  We should add [GDB pretty 
printers|https://sourceware.org/gdb/onlinedocs/gdb/Pretty-Printing-API.html#Pretty-Printing-API]
 to improve the debug workflow for developers.  This could assist not just 
Arrow developers but also developers using the Arrow C++ libs.  (was: Parsing 
the GDB output is error prone and can take considerable time.  Also, some 
information is difficult or non-intuitive to get to (e.g. the name of a data 
type).  We should add GDB pretty printers[1] to improve the debug workflow for 
developers.  This could assist not just Arrow developers but also developers 
using the Arrow C++ libs.)

> [C++] Add pretty printing support for gdb
> -
>
> Key: ARROW-11348
> URL: https://issues.apache.org/jira/browse/ARROW-11348
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Weston Pace
>Priority: Major
>
> Parsing the GDB output is error prone and can take considerable time.  Also, 
> some information is difficult or non-intuitive to get to (e.g. the name of a 
> data type).  We should add [GDB pretty 
> printers|https://sourceware.org/gdb/onlinedocs/gdb/Pretty-Printing-API.html#Pretty-Printing-API]
>  to improve the debug workflow for developers.  This could assist not just 
> Arrow developers but also developers using the Arrow C++ libs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11348) [C++] Add pretty printing support for gdb

2021-01-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270269#comment-17270269
 ] 

Weston Pace edited comment on ARROW-11348 at 1/22/21, 4:52 PM:
---

I've made a first pass at this which improves things considerably.  I will keep 
improving upon this and adding new information / features as I debug and 
hopefully these scripts will be robust enough to merge at some point.  If 
anyone is interested in helping develop with these they are located here: 
[https://github.com/westonpace/arrow/tree/feature/gdb-pretty-printers]

To use the pretty printers you will need something like this in your .gdbinit

 
{code:java}
python
from pathlib import Pathdef load_file(gdb_dir, filename):
  fullpath = str(gdb_dir / filename)
  print(f'Activating pretty printer {fullpath}')
  gdb.execute(f'source {fullpath}')dir_ = Path('.').absolute()
while True:
  gdb_dir = dir_ / 'dev' / 'gdb'
  if gdb_dir.exists():
print(f'Activating pretty printers found at {gdb_dir}')
load_file(gdb_dir, 'find_stl.py')
load_file(gdb_dir, 'pretty_printers.py')
load_file(gdb_dir, 'commands.py')
break
  if dir_ == Path('/'):
print(f'Could not locate pretty printers')
break
  dir_ = dir_.parent
end

{code}
This script will find the printers as long as you are in the arrow directory or 
a subdirectory when you run gdb.

 

 

There is also a utility to try and find the STL pretty printers.  These are 
found using conda so you will need to be in a conda environment with the 
gxx_linux-64 package installed to find them.

 

There is also a utility command `parr` which takes an "expression" and will 
attempt to use one of the arrow pretty print utilities to print the result of 
the expression.

 

Example commands:

 
{code:java}
p *by.data_
p (*(by.data())).child_data
p *((*(by.data())).child_data[0])
p (*((*(by.data())).child_data[0])).buffers
p *((*((*(by.data())).child_data[0])).buffers[1])
p *((*((*(by.data())).child_data[0])).buffers[2])
parr by
{code}
Output with pretty printers:

 

 
{code:java}
(gdb) $1 = ArrayData (type=DT("struct") length=8 offset=0 
buffers=0x55715f68 child_data=0x55715f80)
(gdb) $2 = std::vector of length 2, capacity 2 = 
{std::shared_ptr (use count 2, weak count 0) = {get() = 
0x55713ff0}, std::shared_ptr (use count 2, weak count 0) 
= {
get() = 0x55714070}}
(gdb) $3 = ArrayData (type=DT("string") length=8 offset=0 
buffers=0x55714018 child_data=0x55714030)
(gdb) $4 = std::vector of length 3, capacity 3 = 
{std::shared_ptr (empty) = {get() = 0x0}, 
std::shared_ptr (use count 1, weak count 0) = {get() = 
0x556a5b00}, 
  std::shared_ptr (use count 1, weak count 0) = {get() = 
0x556eee30}}
(gdb) $5 = Buffer (size=36 capacity=64 data_addr=0x74209400 "") = {x00, 
x00, x00, x00, x02, x00, x00, x00, x04, x00, x00, x00, x07, x00, x00, x00, x09, 
x00, x00, x00, x0c, x00, x00, x00, x0e, x00, x00, x00, x10, 
  x00, x00, x00, x13, x00, x00, x00}
(gdb) $6 = Buffer (size=19 capacity=64 data_addr=0x74209080 
"exexwhyexwhyexexwhy") = {x65, x78, x65, x78, x77, x68, x79, x65, x78, x77, 
x68, x79, x65, x78, x65, x78, x77, x68, x79}
(gdb)   -- is_valid: all not null
  -- child 0 type: string
[
  "ex",
  "ex",
  "why",
  "ex",
  "why",
  "ex",
  "ex",
  "why"
]
  -- child 1 type: int32
[
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  1
]
{code}
Output without pretty printers:

 

 
{code:java}
(gdb) $1 = (std::__shared_ptr_access::element_type &) @0x55715f10: {
  type = {> = 
{> = {}, _M_ptr = 
0x556eee70, _M_refcount = {_M_pi = 0x556eee60}}, }, 
length = 8, null_count = {> = {static _S_alignment = 
8, _M_i = 0}, }, 
  offset = 0, buffers = {, 
std::allocator > >> = {
  _M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, , 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x557150b0, _M_finish = 0x557150c0, 
  _M_end_of_storage = 0x557150c0}, }}, }, 
  child_data = {, 
std::allocator > >> = {
  _M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, 
, 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x55714580, _M_finish = 0x557145a0, 
  _M_end_of_storage = 0x557145a0}, }}, }, 
  dictionary = {> = {> = {}, 
  _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, }}
(gdb) $2 = {, 
std::allocator > >> = {
_M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, 
, 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x55714580, _M_finish = 0x557145a0, 
_M_end_of_storage = 0x557145a0}, }}, }
(gdb) $3 = (std::__shared_ptr_access::element_type &) @0x55713fc0: {
  type = {> = 
{> = {}, 
  _M_ptr = 0x556a20f0, _M_refcount = {_M_pi = 0x556a20e0}}, }, length = 8, null_count = {> = {static 
_S_alignment = 8, _M_i = 0}, }, 
  offset = 0, buffers = {

[jira] [Commented] (ARROW-11348) [C++] Add pretty printing support for gdb

2021-01-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270269#comment-17270269
 ] 

Weston Pace commented on ARROW-11348:
-

gxx_linux-64I've made a first pass at this which improves things considerably.  
I will keep improving upon this and adding new information / features as I 
debug and hopefully these scripts will be robust enough to merge at some point. 
 If anyone is interested in helping develop with these they are located here: 
[https://github.com/westonpace/arrow/tree/feature/gdb-pretty-printers]

To use the pretty printers you will need something like this in your .gdbinit

 
{code:java}
python
from pathlib import Pathdef load_file(gdb_dir, filename):
  fullpath = str(gdb_dir / filename)
  print(f'Activating pretty printer {fullpath}')
  gdb.execute(f'source {fullpath}')dir_ = Path('.').absolute()
while True:
  gdb_dir = dir_ / 'dev' / 'gdb'
  if gdb_dir.exists():
print(f'Activating pretty printers found at {gdb_dir}')
load_file(gdb_dir, 'find_stl.py')
load_file(gdb_dir, 'pretty_printers.py')
load_file(gdb_dir, 'commands.py')
break
  if dir_ == Path('/'):
print(f'Could not locate pretty printers')
break
  dir_ = dir_.parent
end

{code}
This script will find the printers as long as you are in the arrow directory or 
a subdirectory when you run gdb.

 

 

There is also a utility to try and find the STL pretty printers.  These are 
found using conda so you will need to be in a conda environment with the 
gxx_linux-64 package installed to find them.

 

There is also a utility command `parr` which takes an "expression" and will 
attempt to use one of the arrow pretty print utilities to print the result of 
the expression.

 

Example commands:

 
{code:java}
p *by.data_
p (*(by.data())).child_data
p *((*(by.data())).child_data[0])
p (*((*(by.data())).child_data[0])).buffers
p *((*((*(by.data())).child_data[0])).buffers[1])
p *((*((*(by.data())).child_data[0])).buffers[2])
parr by
{code}
Output with pretty printers:

 

 
{code:java}
(gdb) $1 = ArrayData (type=DT("struct") length=8 offset=0 
buffers=0x55715f68 child_data=0x55715f80)
(gdb) $2 = std::vector of length 2, capacity 2 = 
{std::shared_ptr (use count 2, weak count 0) = {get() = 
0x55713ff0}, std::shared_ptr (use count 2, weak count 0) 
= {
get() = 0x55714070}}
(gdb) $3 = ArrayData (type=DT("string") length=8 offset=0 
buffers=0x55714018 child_data=0x55714030)
(gdb) $4 = std::vector of length 3, capacity 3 = 
{std::shared_ptr (empty) = {get() = 0x0}, 
std::shared_ptr (use count 1, weak count 0) = {get() = 
0x556a5b00}, 
  std::shared_ptr (use count 1, weak count 0) = {get() = 
0x556eee30}}
(gdb) $5 = Buffer (size=36 capacity=64 data_addr=0x74209400 "") = {x00, 
x00, x00, x00, x02, x00, x00, x00, x04, x00, x00, x00, x07, x00, x00, x00, x09, 
x00, x00, x00, x0c, x00, x00, x00, x0e, x00, x00, x00, x10, 
  x00, x00, x00, x13, x00, x00, x00}
(gdb) $6 = Buffer (size=19 capacity=64 data_addr=0x74209080 
"exexwhyexwhyexexwhy") = {x65, x78, x65, x78, x77, x68, x79, x65, x78, x77, 
x68, x79, x65, x78, x65, x78, x77, x68, x79}
(gdb)   -- is_valid: all not null
  -- child 0 type: string
[
  "ex",
  "ex",
  "why",
  "ex",
  "why",
  "ex",
  "ex",
  "why"
]
  -- child 1 type: int32
[
  0,
  0,
  0,
  1,
  0,
  1,
  0,
  1
]
{code}
Output without pretty printers:

 

 
{code:java}
(gdb) $1 = (std::__shared_ptr_access::element_type &) @0x55715f10: {
  type = {> = 
{> = {}, _M_ptr = 
0x556eee70, _M_refcount = {_M_pi = 0x556eee60}}, }, 
length = 8, null_count = {> = {static _S_alignment = 
8, _M_i = 0}, }, 
  offset = 0, buffers = {, 
std::allocator > >> = {
  _M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, , 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x557150b0, _M_finish = 0x557150c0, 
  _M_end_of_storage = 0x557150c0}, }}, }, 
  child_data = {, 
std::allocator > >> = {
  _M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, 
, 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x55714580, _M_finish = 0x557145a0, 
  _M_end_of_storage = 0x557145a0}, }}, }, 
  dictionary = {> = {> = {}, 
  _M_ptr = 0x0, _M_refcount = {_M_pi = 0x0}}, }}
(gdb) $2 = {, 
std::allocator > >> = {
_M_impl = { >> = 
{<__gnu_cxx::new_allocator >> = {}, }, 
, 
std::allocator > >::_Vector_impl_data> = 
{_M_start = 0x55714580, _M_finish = 0x557145a0, 
_M_end_of_storage = 0x557145a0}, }}, }
(gdb) $3 = (std::__shared_ptr_access::element_type &) @0x55713fc0: {
  type = {> = 
{> = {}, 
  _M_ptr = 0x556a20f0, _M_refcount = {_M_pi = 0x556a20e0}}, }, length = 8, null_count = {> = {static 
_S_alignment = 8, _M_i = 0}, }, 
  offset = 0, buffers = {, 
std::allocator > >> = {
  _M_imp

[jira] [Created] (ARROW-11348) [C++] Add pretty printing support for gdb

2021-01-22 Thread Weston Pace (Jira)
Weston Pace created ARROW-11348:
---

 Summary: [C++] Add pretty printing support for gdb
 Key: ARROW-11348
 URL: https://issues.apache.org/jira/browse/ARROW-11348
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Weston Pace


Parsing the GDB output is error prone and can take considerable time.  Also, 
some information is difficult or non-intuitive to get to (e.g. the name of a 
data type).  We should add GDB pretty printers[1] to improve the debug workflow 
for developers.  This could assist not just Arrow developers but also 
developers using the Arrow C++ libs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Brian Hulette (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270257#comment-17270257
 ] 

Brian Hulette commented on ARROW-11347:
---

Can you clarify where is it that we use Maps that you think should change? Is 
it when accessing an element of a Map-typed array?

I'd be open to changing it but we'd need to consider that this would be a 
breaking API change. I suppose this is technically OK since all releases are 
major but it may be inconvenient for users.

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11344) [Python] Data of struct fields are our-of-order in parquet files created by the write_table() method

2021-01-22 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270222#comment-17270222
 ] 

Weston Pace commented on ARROW-11344:
-

Thank you for creating such a detailed test case.  I have run your test against 
pyarrow 2.0.0 and I can confirm I get the same results that you do.  Luckily, 
when I ran your test against the latest code I did not see this error and I 
confirmed that the full_name.name column aligned with the fruit_name column.  
We have recently fixed issues related to structs such as ARROW-10493 and my 
assumption is that you encountered one of those.

We are on the verge of releasing 3.0.0.  There is an RC available at 
([https://bintray.com/apache/arrow/python-rc/3.0.0-rc2#files/python-rc/3.0.0-rc2)]
 if you would like to test this behavior out yourself sooner.

 

> [Python] Data of struct fields are our-of-order in parquet files created by 
> the write_table() method
> 
>
> Key: ARROW-11344
> URL: https://issues.apache.org/jira/browse/ARROW-11344
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Chen Ming
>Priority: Major
> Attachments: test_struct.csv, test_struct_200.parquet, 
> test_struct_200.py, test_struct_200_flat.parquet, test_struct_200_flat.py
>
>
> Hi,
> We found an our-of-order issue with the 'struct' data type recently, would 
> like to know if you can help to root cause it.
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": 
> x["file_name"]}, axis=1)
> my_df = df.drop(['file_package', 'file_name'], axis=1)
> file_fields = [('package', pa.string()), ('name', pa.string()),]
> my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
>pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(my_df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200.parquet')
> {code}
> The above code (attached as test_struct_200.py) runs with the following 
> python packages:
> {code:java}
> Pandas Version = 1.1.3
> PyArrow Version = 2.0.0
> {code}
> Then I use parquet-tools (1.11.1) to read the file, but get the following 
> output:
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = apple.csv
> fruit_name = strawberry
> {code}
> (BTW, you can also view the parquet file with 
> [http://parquet-viewer-online.com/])
> The output is supposed to be (refer to test_struct.csv) :
> {code:java}
> $ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
> ...
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> full_name:
> .package = fruit.zip
> .name = strawberry.csv
> fruit_name = strawberry
> {code}
> As a comparison, the following code (attached as test_struct_200_flat.py) 
> would generate a parquet file with the same data of test_struct.csv:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.read_csv('./test_struct.csv')
> print(df.dtypes)
> my_schema = pa.schema([pa.field('file_package', pa.string()),
>pa.field('file_name', pa.string()),
>pa.field('fruit_name', pa.string())])
> my_table = pa.Table.from_pandas(df, schema = my_schema)
> print('Table schema:')
> print(my_table.schema)
> pq.write_table(my_table, './test_struct_200_flat.parquet')
> {code}
> I also attached the two parquet files for your references.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11332) [Rust] Use MutableBuffer in take_string instead of Vec

2021-01-22 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11332.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9279
[https://github.com/apache/arrow/pull/9279]

> [Rust] Use MutableBuffer in take_string instead of Vec
> --
>
> Key: ARROW-11332
> URL: https://issues.apache.org/jira/browse/ARROW-11332
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Dominik Moritz (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270206#comment-17270206
 ] 

Dominik Moritz commented on ARROW-11347:


I wonder what [~bhulette] and [~paultaylor] say about this since they 
originally decided to go with Map. 

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270182#comment-17270182
 ] 

Neville Dipale edited comment on ARROW-11347 at 1/22/21, 2:56 PM:
--

Hi [~domoritz]

The performance difference looks solid.

I've tried that notebook on Chrome vs Safari (Macbook Air M1).
Object: ~700ms vs ~2'600ms
Map: ~5'300ms vs ~4'800ms

On Chrome vs Firefox (Ryzen desktop)
Object: ~700ms vs ~600ms
Map: ~3'800ms vs ~ 11'600ms

Do you think that there'd be a downside to using Object, in the ergonomics of 
the APIs?

I haven't used the JS implementation enough to have an opinion, hence I'm 
asking.

If you can open a PR with the change, we can review it and get it merged.

Thanks


was (Author: nevi_me):
Hi [~domoritz]

The performance difference looks solid. Do you think that there'd be a downside 
to using Object, in the ergonomics of the APIs?

I haven't used the JS implementation enough to have an opinion, hence I'm 
asking.

If you can open a PR with the change, we can review it and get it merged.

Thanks

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Neville Dipale (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270182#comment-17270182
 ] 

Neville Dipale commented on ARROW-11347:


Hi [~domoritz]

The performance difference looks solid. Do you think that there'd be a downside 
to using Object, in the ergonomics of the APIs?

I haven't used the JS implementation enough to have an opinion, hence I'm 
asking.

If you can open a PR with the change, we can review it and get it merged.

Thanks

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11347) [JavaScript] Consider Objects instead of Maps

2021-01-22 Thread Neville Dipale (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale updated ARROW-11347:
---
Summary: [JavaScript] Consider Objects instead of Maps  (was: Consider 
Objects instead of Maps)

> [JavaScript] Consider Objects instead of Maps
> -
>
> Key: ARROW-11347
> URL: https://issues.apache.org/jira/browse/ARROW-11347
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Priority: Major
>  Labels: performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A quick experiment 
> (https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
> show that object accesses are a lot faster than map accesses. Would it make 
> sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11299) [Python] build warning in python

2021-01-22 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-11299:
-
Affects Version/s: 2.0.0

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 
> warning: offsetof within non-standard-lay

[jira] [Assigned] (ARROW-11299) [Python] build warning in python

2021-01-22 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-11299:


Assignee: Yibo Cai

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__VarianceOptions’

[jira] [Updated] (ARROW-11299) [Python] build warning in python

2021-01-22 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-11299:
-
Fix Version/s: 4.0.0

> [Python] build warning in python
> 
>
> Key: ARROW-11299
> URL: https://issues.apache.org/jira/browse/ARROW-11299
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Many warnings about compute kernel options when building Arrow python.
> Removing below line suppresses the warnings.
> https://github.com/apache/arrow/blob/140135908c5d131ceac31a0e529f9b9b763b1106/cpp/src/arrow/compute/function.h#L45
> I think the reason is virtual destructor makes the structure non C compatible 
> and cannot use offsetof macro safely.
> As function options are straightforward, looks destructor is not necessary.
> [~bkietz]
> *Steps to reproduce*
> build arrow cpp
> {code:bash}
>  ~/arrow/cpp/release $ cmake -GNinja -DCMAKE_BUILD_TYPE=Release 
> -DARROW_COMPUTE=ON -DARROW_BUILD_TESTS=ON 
> -DCMAKE_INSTALL_PREFIX=$(pwd)/_install -DCMAKE_INSTALL_LIBDIR=lib 
> -DARROW_PYTHON=ON -DCMAKE_CXX_COMPILER=/usr/bin/clang++-9 
> -DCMAKE_C_COMPILER=/usr/bin/clang-9 ..
> ~/arrow/cpp/release $ ninja install
> {code}
> build arrow python
> {code:bash}
>  ~/arrow/python $ python --version
>  Python 3.6.9
> ~/arrow/python $ python setup.py build_ext --inplace
>  ..
>  [ 93%] Building CXX object CMakeFiles/_compute.dir/_compute.cpp.o [27/1691]
>  In file included from 
> /usr/include/x86_64-linux-gnu/bits/types/stack_t.h:23:0, 
>  from /usr/include/signal.h:303,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/npy_interrupt.h:84,
>  from 
> /home/cyb/archery/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:5,
>  from 
> /home/cyb/arrow/cpp/release/_install/include/arrow/python/numpy_interop.h:41,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/helpers.h:27,
>  from /home/cyb/arrow/cpp/release/_install/include/arrow/python/api.h:24,
>  from /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:696:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp: In function 
> ‘int __Pyx_modinit_type_init_code()’:
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26034:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CastOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__CastOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CastOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26066:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__FilterOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__FilterOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__FilterOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26082:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__TakeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__TakeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__TakeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26130:150: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__MinMaxOptions’ is undefined 
> [-Winvalid-offsetof]
>  type_7pyarrow_8_compute__MinMaxOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__MinMaxOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26146:148: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__CountOptions’ is undefined [-Winvalid-offsetof]
>  _type_7pyarrow_8_compute__CountOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__CountOptions, 
> __pyx_base.__pyx_base.__weakref__);
>  ^ 
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26162:146: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8_compute__ModeOptions’ is undefined [-Winvalid-offsetof]
>  x_type_7pyarrow_8_compute__ModeOptions.tp_weaklistoffset = offsetof(struct 
> __pyx_obj_7pyarrow_8_compute__ModeOptions, __pyx_base.__pyx_base.__weakref__);
>  ^
>  /home/cyb/arrow/python/build/temp.linux-x86_64-3.6/_compute.cpp:26210:154: 
> warning: offsetof within non-standard-layout type 
> ‘__pyx_obj_7pyarrow_8

[jira] [Comment Edited] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8

2021-01-22 Thread Maximilian Speicher (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270118#comment-17270118
 ] 

Maximilian Speicher edited comment on ARROW-9745 at 1/22/21, 1:13 PM:
--

For me the same error persists even after doing a clean reinstall of Python and 
recreating the venv. It somehow seems to be related to snappy compression, as 
it works fine when using gzip as the compression.

*Update:* Running the same code on the same machine inside of WSL works just 
fine.


was (Author: mspeicher):
For me the same error persists even after doing a clean reinstall of Python and 
recreating the venv. It somehow seems to be related to snappy compression, as 
it works fine when using gzip as the compression.

> [Python] Reading Parquet file crashes on windows - python3.8
> 
>
> Key: ARROW-9745
> URL: https://issues.apache.org/jira/browse/ARROW-9745
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Installation done with pip:
> pip install pyarrow pandas
> for python3.8 on a windows machine running windows 10 Enterprise (v1809). The 
> resulting wheel is:
> pyarrow-1.0.0-cp38-cp38-win_amd64.whl 
>Reporter: Dylan Modesitt
>Priority: Major
>  Labels: parquet
>
> {code:java}
> import pandas 
> import numpy 
> df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), 
> columns=list("1234"))
> df.to_parquet("the.parquet")
> pd.read_parquet("the.parquet")  # fails here
> {code}
> fails with
> {code:java}
> Process finished with exit code -1073741795 (0xC01D)
> {code}
> {code:java}
> pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas()
> {code}
> also fails with the same exit message. Has this been seen before? Is there a 
> known solution? I experienced the same issue installing the pyarrow nightlies 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9745) [Python] Reading Parquet file crashes on windows - python3.8

2021-01-22 Thread Maximilian Speicher (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17270118#comment-17270118
 ] 

Maximilian Speicher commented on ARROW-9745:


For me the same error persists even after doing a clean reinstall of Python and 
recreating the venv. It somehow seems to be related to snappy compression, as 
it works fine when using gzip as the compression.

> [Python] Reading Parquet file crashes on windows - python3.8
> 
>
> Key: ARROW-9745
> URL: https://issues.apache.org/jira/browse/ARROW-9745
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Installation done with pip:
> pip install pyarrow pandas
> for python3.8 on a windows machine running windows 10 Enterprise (v1809). The 
> resulting wheel is:
> pyarrow-1.0.0-cp38-cp38-win_amd64.whl 
>Reporter: Dylan Modesitt
>Priority: Major
>  Labels: parquet
>
> {code:java}
> import pandas 
> import numpy 
> df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), 
> columns=list("1234"))
> df.to_parquet("the.parquet")
> pd.read_parquet("the.parquet")  # fails here
> {code}
> fails with
> {code:java}
> Process finished with exit code -1073741795 (0xC01D)
> {code}
> {code:java}
> pyarrow.parquet.read_pandas(pyarrow.BufferReader(...)).to_pandas()
> {code}
> also fails with the same exit message. Has this been seen before? Is there a 
> known solution? I experienced the same issue installing the pyarrow nightlies 
> as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-10766) [Rust] Compute nested definition and repetition for list arrays

2021-01-22 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-10766.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9240
[https://github.com/apache/arrow/pull/9240]

> [Rust] Compute nested definition and repetition for list arrays
> ---
>
> Key: ARROW-10766
> URL: https://issues.apache.org/jira/browse/ARROW-10766
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> This extends on ARROW-9728 by only focusing on list array repetition and 
> definition levels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11343) [DataFusion] Simplified example

2021-01-22 Thread Andrew Lamb (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11343.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9290
[https://github.com/apache/arrow/pull/9290]

> [DataFusion] Simplified example
> ---
>
> Key: ARROW-11343
> URL: https://issues.apache.org/jira/browse/ARROW-11343
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11346) [C++][Compute] Implement quantile kernel benchmark

2021-01-22 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11346:
---
Labels: pull-request-available  (was: )

> [C++][Compute] Implement quantile kernel benchmark
> --
>
> Key: ARROW-11346
> URL: https://issues.apache.org/jira/browse/ARROW-11346
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11347) Consider Objects instead of Maps

2021-01-22 Thread Dominik Moritz (Jira)
Dominik Moritz created ARROW-11347:
--

 Summary: Consider Objects instead of Maps
 Key: ARROW-11347
 URL: https://issues.apache.org/jira/browse/ARROW-11347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Reporter: Dominik Moritz


A quick experiment 
(https://observablehq.com/@domoritz/performance-of-maps-vs-objects) seems to 
show that object accesses are a lot faster than map accesses. Would it make 
sense to switch to objects in the row API to improve performance? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11346) [C++][Compute] Implement quantile kernel benchmark

2021-01-22 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-11346:


 Summary: [C++][Compute] Implement quantile kernel benchmark
 Key: ARROW-11346
 URL: https://issues.apache.org/jira/browse/ARROW-11346
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yibo Cai
Assignee: Yibo Cai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)