[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-25 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938260#comment-16938260
 ] 

Suvayu Ali commented on ARROW-4930:
---

Hi [~kou] I'm a bit out of my depth here, but here's my attempt: 
https://github.com/apache/arrow/pull/5504

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: pull-request-available, setup.py
> Fix For: 2.0.0
>
> Attachments: FindArrow.cmake.patch, FindParquet.cmake.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-23 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936383#comment-16936383
 ] 

Suvayu Ali commented on ARROW-4930:
---

Hi [~apitrou], I have had limited success so far.

[I was working off of master, {{git describe}} says: 
{{apache-arrow-0.14.0-584-g176adf5a0}}]

This is what I found:

1. {{setup.py}} makes the library directory is {{$ARROW_HOME/lib}} when setting 
{{PKG_CONFIG_PATH}} in the environment (line 253). I believe this is bit of a 
hack, which is also mentioned by the author in the issue that tracked that 
change ARROW-1090. The resolution should be somewhere in the cmake scripts.
 2. I successfully detected {{libarrow}} with the attached patch 
[^FindArrow.cmake.patch].
 3. However I then failed to detect {{libparquet}}. On further investigation I 
found (AFAIU) that even though {{FindParquet.cmake}} sets {{ARROW_HOME}}, it is 
not used. However, it does use {{PARQUET_HOME}}. Since my CMake foo is a bit 
weak, I worked up a similar patch [^FindParquet.cmake.patch] as before and set 
{{export PARQUET_HOME=$ARROW_HOME}} in the terminal. This allowed the 
compilation to succeed.

The compilation commands I used for C++ and Python are:
{code:java}
$ cmake -G Ninja -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DARROW_FLIGHT=ON -DARROW_GANDIVA=ON -DARROW_ORC=ON \
  -DARROW_PARQUET=ON -DPYTHON_EXECUTABLE=/usr/bin/python3.7m \
  -DARROW_PYTHON=ON -DARROW_PLASMA=ON \
  -DARROW_BUILD_TESTS=ON -DLLVM_DIR=/usr/lib64/llvm7.0 ..
$ python3 setup.py build_ext --cmake-generator Ninja --inplace
{code}
I then tried to run the python tests with {{pytest-3 pyarrow}}. The summary was:
{quote}5 failed, 1411 passed, 59 skipped, 4 xfailed, 29 warnings in 28.30 
seconds
{quote}
The failures are all some kind of setup related issues, not being able to 
import, not being able to start plasma, etc.

I'll investigate this further, but my take is the cmake scripts don't actually 
have _one way_ of detecting the libraries, making it very difficult to 
configure it properly from setup.py.

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
> Attachments: FindArrow.cmake.patch, FindParquet.cmake.patch
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-23 Thread Suvayu Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-4930:
--
Attachment: FindParquet.cmake.patch
FindArrow.cmake.patch

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
> Attachments: FindArrow.cmake.patch, FindParquet.cmake.patch
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-4930) [Python] Remove LIBDIR assumptions in Python build

2019-09-18 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933039#comment-16933039
 ] 

Suvayu Ali commented on ARROW-4930:
---

I have some time this weekend, I'll have a go at it.

> [Python] Remove LIBDIR assumptions in Python build
> --
>
> Key: ARROW-4930
> URL: https://issues.apache.org/jira/browse/ARROW-4930
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.12.1
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: setup.py
> Fix For: 2.0.0
>
>
> This is in reference to (4) in 
> [this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
>  mailing list discussion.
> Certain sections of setup.py assume a specific location of the C++ libraries. 
> Removing this hard assumption will simplify PyArrow builds significantly. As 
> far as I could tell these assumptions are made in the 
> {{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
> handled).
>  # The first occurrence is before invoking cmake (see line 237).
>  # The second occurrence is when the C++ libraries are moved from their build 
> directory to the Python tree (see line 347). The actual implementation is in 
> the function {{_move_shared_libs_unix(..)}} (see line 468).
> Hope this helps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931697#comment-16931697
 ] 

Suvayu Ali commented on ARROW-6577:
---

For completeness, I managed to upgrade `conda` to 4.7.11, and now the problem 
does not occur any more.

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Assignee: Uwe L. Korn
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931383#comment-16931383
 ] 

Suvayu Ali commented on ARROW-6577:
---

[~Igor Yastrebov] Thanks a lot, I'll see if I can upgrade `conda`.  My issues 
were also mostly with boost.

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Assignee: Uwe L. Korn
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931357#comment-16931357
 ] 

Suvayu Ali commented on ARROW-6577:
---

[~Igor Yastrebov] Yes

[~xhochy] Hmm, it's not easy for me to upgrade conda itself. Thanks for 
investigating.  I'll see what I can do. 

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931333#comment-16931333
 ] 

Suvayu Ali edited comment on ARROW-6577 at 9/17/19 11:26 AM:
-

Hi Uwe, I mentioned the conda version in the Environment field above (4.6.13), 
and my condarc looks like this:

{code}
channels:
  - conda-forge
  - defaults
channel_priority: strict
auto_activate_base: true
pip_interop_enabled: true
{code}

I have also seen this on my colleagues Mac (don't know the environment details).


was (Author: suvayu):
Hi Uwe, I mentioned the conda version in the Environment field above (4.6.13), 
and my condarc looks like this:

{code}
 channels:
  - conda-forge
  - defaults
channel_priority: strict
auto_activate_base: true
pip_interop_enabled: true
{code}

I have also seen this on my colleagues Mac (don't know the environment details).

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931333#comment-16931333
 ] 

Suvayu Ali commented on ARROW-6577:
---

Hi Uwe, I mentioned the conda version in the Environment field above (4.6.13), 
and my condarc looks like this:

{code}
 channels:
  - conda-forge
  - defaults
channel_priority: strict
auto_activate_base: true
pip_interop_enabled: true
{code}

I have also seen this on my colleagues Mac (don't know the environment details).

> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-6577:
--
Description: 
When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
0.12.1. I think a common dependency is causing the downgrade, my guess is boost 
or protobuf. This is based on several instances of this issue I encountered 
over the last few weeks. It took me a while to find a somewhat reproducible 
recipe.
{code:java}
$ conda create -n test pyarrow pandas numpy
...
Proceed ([y]/n)? y
...
$ conda install -n test ipython
...
Proceed ([y]/n)? n
CondaSystemExit: Exiting.
{code}
I have attached a mildly edited (to remove progress bars, and control 
characters) transcript of this session. Here {{ipython}} triggers the problem, 
and downgrades {{pyarrow}} to 0.12.1, but I think there are other common 
packages who also conflict in this way. Please let me know if I can provide 
more info.

  was:
When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
picked up.  But installing certain packages downgrades pyarrow to 0.13.0 or 
0.12.1.  I think a common dependency is causing the downgrade, my guess is 
boost.  This is based on several instances of this issue I encountered over the 
last few weeks.  It took me a while to find a somewhat reproducible recipe.

{code}
$ conda create -n test pyarrow pandas numpy
...
Proceed ([y]/n)? y
...
$ conda install -n test ipython
...
Proceed ([y]/n)? n
CondaSystemExit: Exiting.
{code}

I have attached a mildly edited (to remove progress bars, and control 
characters) transcript of this session.  Here {{ipython}} triggers the problem, 
and downgrades {{pyarrow}} to 0.12.1, but I think there are other common 
packages who also conflict in this way.  Please let me know if I can provide 
more info.


> Dependency conflict in conda packages
> -
>
> Key: ARROW-6577
> URL: https://issues.apache.org/jira/browse/ARROW-6577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.1
> Environment: kernel: 5.2.11-200.fc30.x86_64
> conda 4.6.13
> Python 3.7.3
>Reporter: Suvayu Ali
>Priority: Major
> Attachments: pa-conda.txt
>
>
> When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
> picked up. But installing certain packages downgrades pyarrow to 0.13.0 or 
> 0.12.1. I think a common dependency is causing the downgrade, my guess is 
> boost or protobuf. This is based on several instances of this issue I 
> encountered over the last few weeks. It took me a while to find a somewhat 
> reproducible recipe.
> {code:java}
> $ conda create -n test pyarrow pandas numpy
> ...
> Proceed ([y]/n)? y
> ...
> $ conda install -n test ipython
> ...
> Proceed ([y]/n)? n
> CondaSystemExit: Exiting.
> {code}
> I have attached a mildly edited (to remove progress bars, and control 
> characters) transcript of this session. Here {{ipython}} triggers the 
> problem, and downgrades {{pyarrow}} to 0.12.1, but I think there are other 
> common packages who also conflict in this way. Please let me know if I can 
> provide more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6577) Dependency conflict in conda packages

2019-09-17 Thread Suvayu Ali (Jira)
Suvayu Ali created ARROW-6577:
-

 Summary: Dependency conflict in conda packages
 Key: ARROW-6577
 URL: https://issues.apache.org/jira/browse/ARROW-6577
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Affects Versions: 0.14.1
 Environment: kernel: 5.2.11-200.fc30.x86_64
conda 4.6.13
Python 3.7.3
Reporter: Suvayu Ali
 Attachments: pa-conda.txt

When I install pyarrow on a fresh environment, the latest version (0.14.1) is 
picked up.  But installing certain packages downgrades pyarrow to 0.13.0 or 
0.12.1.  I think a common dependency is causing the downgrade, my guess is 
boost.  This is based on several instances of this issue I encountered over the 
last few weeks.  It took me a while to find a somewhat reproducible recipe.

{code}
$ conda create -n test pyarrow pandas numpy
...
Proceed ([y]/n)? y
...
$ conda install -n test ipython
...
Proceed ([y]/n)? n
CondaSystemExit: Exiting.
{code}

I have attached a mildly edited (to remove progress bars, and control 
characters) transcript of this session.  Here {{ipython}} triggers the problem, 
and downgrades {{pyarrow}} to 0.12.1, but I think there are other common 
packages who also conflict in this way.  Please let me know if I can provide 
more info.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-5871) [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-14 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884841#comment-16884841
 ] 

Suvayu Ali commented on ARROW-5871:
---

Hi [~wesmckinn], I was able to build arrow-cpp and pyarrow from source from the 
maint-0.14.x branch.  Although I have not done any testing like, installing the 
wheel on different platforms, the above crash does not happen when I do a 
simple import.

> [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt
> 
>
> Key: ARROW-5871
> URL: https://issues.apache.org/jira/browse/ARROW-5871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.0
> Environment: 5.1.16-300.fc30.x86_64
> Python 3.7.3
> libxcrypt-4.4.6-2.fc30.x86_64
>Reporter: Suvayu Ali
>Priority: Major
> Fix For: 1.0.0
>
>
> In a freshly created virtual environment, after I install pyarrow 0.14.0 
> (using pip), importing pyarrow from the python prompt leads to crash:
> {code:java}
> $ mktmpenv
> [..]
> This is a temporary environment. It will be deleted when you run 'deactivate'.
> $ pip install pyarrow
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> Collecting numpy>=1.14 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
> Collecting six>=1.0.0 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
> Installing collected packages: numpy, six, pyarrow
> Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
> $ python --version
> Python 3.7.3
> $ python -m pyarrow
> Traceback (most recent call last):
> File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
> return _get_module_details(pkg_main_name, error)
> File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
> File 
> "/home/user/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
> directory{code}
> This is surprising because I have older versions of pyarrow (up to 0.13.0) 
> working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-5871) [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-09 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881214#comment-16881214
 ] 

Suvayu Ali commented on ARROW-5871:
---

I think that's the instruction I followed the last time I tried (around March), 
it even led to a patch or two.  I'll give it another go this weekend.

> [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt
> 
>
> Key: ARROW-5871
> URL: https://issues.apache.org/jira/browse/ARROW-5871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.0
> Environment: 5.1.16-300.fc30.x86_64
> Python 3.7.3
> libxcrypt-4.4.6-2.fc30.x86_64
>Reporter: Suvayu Ali
>Priority: Major
> Fix For: 1.0.0
>
>
> In a freshly created virtual environment, after I install pyarrow 0.14.0 
> (using pip), importing pyarrow from the python prompt leads to crash:
> {code:java}
> $ mktmpenv
> [..]
> This is a temporary environment. It will be deleted when you run 'deactivate'.
> $ pip install pyarrow
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> Collecting numpy>=1.14 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
> Collecting six>=1.0.0 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
> Installing collected packages: numpy, six, pyarrow
> Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
> $ python --version
> Python 3.7.3
> $ python -m pyarrow
> Traceback (most recent call last):
> File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
> return _get_module_details(pkg_main_name, error)
> File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
> File 
> "/home/user/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
> directory{code}
> This is surprising because I have older versions of pyarrow (up to 0.13.0) 
> working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5871) [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-08 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880948#comment-16880948
 ] 

Suvayu Ali commented on ARROW-5871:
---

Hi [~wesmckinn], I read that issue.  Unfortunately my experience with conda has 
been rather frustrating.  I think for production use I'll stick to 0.13.0 for 
now, and try to compile from source for experimental use.  Unfortunately I have 
never successfully managed to compile pyarrow before (no issues with the C++ 
library though).

Thanks a lot

> [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt
> 
>
> Key: ARROW-5871
> URL: https://issues.apache.org/jira/browse/ARROW-5871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.0
> Environment: 5.1.16-300.fc30.x86_64
> Python 3.7.3
> libxcrypt-4.4.6-2.fc30.x86_64
>Reporter: Suvayu Ali
>Priority: Major
> Fix For: 1.0.0
>
>
> In a freshly created virtual environment, after I install pyarrow 0.14.0 
> (using pip), importing pyarrow from the python prompt leads to crash:
> {code:java}
> $ mktmpenv
> [..]
> This is a temporary environment. It will be deleted when you run 'deactivate'.
> $ pip install pyarrow
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> Collecting numpy>=1.14 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
> Collecting six>=1.0.0 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
> Installing collected packages: numpy, six, pyarrow
> Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
> $ python --version
> Python 3.7.3
> $ python -m pyarrow
> Traceback (most recent call last):
> File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
> return _get_module_details(pkg_main_name, error)
> File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
> File 
> "/home/user/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
> directory{code}
> This is surprising because I have older versions of pyarrow (up to 0.13.0) 
> working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5871) [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-08 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880548#comment-16880548
 ] 

Suvayu Ali commented on ARROW-5871:
---

Hi [~wesmckinn], I see the same issue with the manylinux1 wheel.

> [Python] Can't import pyarrow 0.14.0 due to mismatching libcrypt
> 
>
> Key: ARROW-5871
> URL: https://issues.apache.org/jira/browse/ARROW-5871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.0
> Environment: 5.1.16-300.fc30.x86_64
> Python 3.7.3
> libxcrypt-4.4.6-2.fc30.x86_64
>Reporter: Suvayu Ali
>Priority: Major
>
> In a freshly created virtual environment, after I install pyarrow 0.14.0 
> (using pip), importing pyarrow from the python prompt leads to crash:
> {code:java}
> $ mktmpenv
> [..]
> This is a temporary environment. It will be deleted when you run 'deactivate'.
> $ pip install pyarrow
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> Collecting numpy>=1.14 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
> Collecting six>=1.0.0 (from pyarrow)
> Using cached 
> https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
> Installing collected packages: numpy, six, pyarrow
> Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
> $ python --version
> Python 3.7.3
> $ python -m pyarrow
> Traceback (most recent call last):
> File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
> mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
> File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
> return _get_module_details(pkg_main_name, error)
> File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
> __import__(pkg_name)
> File 
> "/home/user/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
>  line 49, in 
> from pyarrow.lib import cpu_count, set_cpu_count
> ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
> directory{code}
> This is surprising because I have older versions of pyarrow (up to 0.13.0) 
> working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5871) Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-07 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-5871:
--
Description: 
In a freshly created virtual environment, after I install pyarrow 0.14.0 (using 
pip), importing pyarrow from the python prompt leads to crash:
{code:java}
$ mktmpenv
[..]
This is a temporary environment. It will be deleted when you run 'deactivate'.
$ pip install pyarrow
Collecting pyarrow
Using cached 
https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
Collecting numpy>=1.14 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
Collecting six>=1.0.0 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: numpy, six, pyarrow
Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
$ python --version
Python 3.7.3
$ python -m pyarrow
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File 
"/home/user/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
 line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
directory{code}
This is surprising because I have older versions of pyarrow (up to 0.13.0) 
working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!

  was:
In a freshly created virtual environment, after I install pyarrow 0.14.0 (using 
pip), importing pyarrow from the python prompt leads to crash:
{code:java}
$ mktmpenv
[..]
This is a temporary environment. It will be deleted when you run 'deactivate'.
$ pip install pyarrow
Collecting pyarrow
Using cached 
https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
Collecting numpy>=1.14 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
Collecting six>=1.0.0 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: numpy, six, pyarrow
Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
$ python --version
Python 3.7.3
$ python -m pyarrow
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File 
"/home/jallad/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
 line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
directory{code}
This is surprising because I have older versions of pyarrow (up to 0.13.0) 
working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!


> Can't import pyarrow 0.14.0 due to mismatching libcrypt
> ---
>
> Key: ARROW-5871
> URL: https://issues.apache.org/jira/browse/ARROW-5871
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging
>Affects Versions: 0.14.0
> Environment: 5.1.16-300.fc30.x86_64
> Python 3.7.3
> libxcrypt-4.4.6-2.fc30.x86_64
>Reporter: Suvayu Ali
>Priority: Major
>
> In a freshly created virtual environment, after I install pyarrow 0.14.0 
> (using pip), importing pyarrow from the python prompt leads to crash:
> {code:java}
> $ mktmpenv
> [..]
> This is a temporary environment. It will be deleted when you run 'deactivate'.
> $ pip install pyarrow
> Collecting pyarrow
> Using cached 
> https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
> Collecting numpy>=1.14 (from pyarrow)
> Using cached 
> 

[jira] [Created] (ARROW-5871) Can't import pyarrow 0.14.0 due to mismatching libcrypt

2019-07-07 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-5871:
-

 Summary: Can't import pyarrow 0.14.0 due to mismatching libcrypt
 Key: ARROW-5871
 URL: https://issues.apache.org/jira/browse/ARROW-5871
 Project: Apache Arrow
  Issue Type: Bug
  Components: Packaging
Affects Versions: 0.14.0
 Environment: 5.1.16-300.fc30.x86_64
Python 3.7.3
libxcrypt-4.4.6-2.fc30.x86_64
Reporter: Suvayu Ali


In a freshly created virtual environment, after I install pyarrow 0.14.0 (using 
pip), importing pyarrow from the python prompt leads to crash:
{code:java}
$ mktmpenv
[..]
This is a temporary environment. It will be deleted when you run 'deactivate'.
$ pip install pyarrow
Collecting pyarrow
Using cached 
https://files.pythonhosted.org/packages/8f/fa/407667d763c25c3d9977e1d19038df3b4a693f37789c4fe1fe5c74a6bc55/pyarrow-0.14.0-cp37-cp37m-manylinux2010_x86_64.whl
Collecting numpy>=1.14 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/fc/d1/45be1144b03b6b1e24f9a924f23f66b4ad030d834ad31fb9e5581bd328af/numpy-1.16.4-cp37-cp37m-manylinux1_x86_64.whl
Collecting six>=1.0.0 (from pyarrow)
Using cached 
https://files.pythonhosted.org/packages/73/fb/00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-any.whl
Installing collected packages: numpy, six, pyarrow
Successfully installed numpy-1.16.4 pyarrow-0.14.0 six-1.12.0
$ python --version
Python 3.7.3
$ python -m pyarrow
Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 183, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/usr/lib64/python3.7/runpy.py", line 142, in _get_module_details
return _get_module_details(pkg_main_name, error)
File "/usr/lib64/python3.7/runpy.py", line 109, in _get_module_details
__import__(pkg_name)
File 
"/home/jallad/.virtualenvs/tmp-8a4d52e7bb62853/lib/python3.7/site-packages/pyarrow/__init__.py",
 line 49, in 
from pyarrow.lib import cpu_count, set_cpu_count
ImportError: libcrypt.so.1: cannot open shared object file: No such file or 
directory{code}
This is surprising because I have older versions of pyarrow (up to 0.13.0) 
working, and libcrypt on my system (Fedora 30, Python 3.7) is libcrypt.so.2!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4930) Remove LIBDIR assumptions in Python build

2019-03-17 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-4930:
-

 Summary: Remove LIBDIR assumptions in Python build
 Key: ARROW-4930
 URL: https://issues.apache.org/jira/browse/ARROW-4930
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.12.1
Reporter: Suvayu Ali


This is in reference to (4) in 
[this|http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3C0AF328A1-ED2A-457F-B72D-3B49C8614850%40xhochy.com%3E]
 mailing list discussion.

Certain sections of setup.py assume a specific location of the C++ libraries. 
Removing this hard assumption will simplify PyArrow builds significantly. As 
far as I could tell these assumptions are made in the 
{{build_ext._run_cmake()}} method (wherever bundling of C++ libraries are 
handled).
 # The first occurrence is before invoking cmake (see line 237).
 # The second occurrence is when the C++ libraries are moved from their build 
directory to the Python tree (see line 347). The actual implementation is in 
the function {{_move_shared_libs_unix(..)}} (see line 468).

Hope this helps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4814) [Python] Exception when writing nested columns that are tuples to parquet

2019-03-10 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-4814:
-

 Summary: [Python] Exception when writing nested columns that are 
tuples to parquet
 Key: ARROW-4814
 URL: https://issues.apache.org/jira/browse/ARROW-4814
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.1
 Environment: 4.20.8-100.fc28.x86_64
Reporter: Suvayu Ali
 Attachments: df_to_parquet_fail.py, test.csv

I get an exception when I try to write a {{pandas.DataFrame}} to a parquet file 
where one of the columns has tuples in them.  I use tuples here because it 
allows for easier querying in pandas (see ARROW-3806 for a more detailed 
description).

{code}
Traceback (most recent call last):
  File "df_to_parquet_fail.py", line 5, in 
df.to_parquet("test.parquet")  # crashes
  File "/home/user/.local/lib/python3.6/site-packages/pandas/core/frame.py", 
line 2203, in to_parquet
   
partition_cols=partition_cols, **kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
line 252, in to_parquet 
   
partition_cols=partition_cols, **kwargs)
  File "/home/user/.local/lib/python3.6/site-packages/pandas/io/parquet.py", 
line 113, in write  
   
table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1141, in pyarrow.lib.Table.from_pandas
  File 
"/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 
431, in dataframe_to_arrays 
  
convert_types)]
  File 
"/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 
430, in   
  
for c, t in zip(columns_to_convert,
  File 
"/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 
426, in convert_column  
  
raise e
  File 
"/home/user/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 
420, in convert_column  
  
return pa.array(col, type=ty, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 176, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 85, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert ('G',) with type tuple: did not 
recognize Python value type when inferring an Arrow data type", 'Conversion 
failed for column ALTS with type object')
{code}

The issue maybe replicated with the attached script and csv file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708321#comment-16708321
 ] 

Suvayu Ali commented on ARROW-3874:
---

Since I'm using {{java-1.8.0-openjdk}}, I had to install 
{{java-1.8.0-openjdk-devel}} to get {{jni.h}}.  For other java versions on F29, 
it should be {{java--openjdk-devel}}. 

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Assignee: Suvayu Ali
>Priority: Major
>  Labels: cmake, pull-request-available
> Fix For: 0.12.0
>
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-12-03 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707086#comment-16707086
 ] 

Suvayu Ali commented on ARROW-3874:
---

Done: [https://github.com/apache/arrow/pull/3072]

Your question about {{jni.h}} gave me enough hints to find the correct missing 
package :), and now the build progresses until it fails with:

{code}
Scanning dependencies of target csv-chunker-test
CMakeFiles/json-integration-test.dir/json-integration-test.cc.o:json-integration-test.cc:function
 boost::system::error_category::std_category::equivalent(std::error_code 
const&, int) const:
error: undefined reference to 'boost::system::detail::generic_category_ncx()'
{code}

This is strange because I have {{boost-system-1.66.0-14.fc29.x86_64}} installed 
on my system.  But I guess that's a test, and the libraries were built 
successfully.

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-28 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702776#comment-16702776
 ] 

Suvayu Ali edited comment on ARROW-3874 at 11/29/18 6:42 AM:
-

Okay, to summarise: my initial build issue on F28 was resolved by installing 
the llvm-static libraries.

On F29, cmake cannot find the correct version of LLVM.
{code}
$ export ARROW_HOME=~/opt 
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

/usr/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0
/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0

Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:25 (find_package)
{code}


Fedora provides alternate llvm versions installed in subdirectories, so I tried 
specifying {{LLVM_DIR}} when invoking cmake.
{code}
$ ls /usr/lib64/llvm6.0/
bin  include  lib
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON \
-DLLVM_DIR=/usr/lib64/llvm6.0 ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

/usr/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0
/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0

Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:25 (find_package)
{code}


So I patched {{find_library}} (see [^arrow-cmake-findllvm.patch]), that fixes 
the LLVM issue, but then I encounter the following Java issue
{code}
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON \
-DLLVM_DIR=/usr/lib64/llvm6.0 ../
[...]
-- Found LLVM 6.0.1
-- Using LLVMConfig.cmake in: /usr/lib64/llvm6.0/lib/cmake/llvm
-- Found clang /usr/lib64/ccache/clang
-- Found llvm-link /usr/lib64/llvm6.0/bin/llvm-link
CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137 
(message):
  Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_INCLUDE_PATH
  JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 
(_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake/Modules/FindJNI.cmake:356 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  src/gandiva/jni/CMakeLists.txt:21 (find_package)
{code}


My Java setup
{code}
$ echo $JAVA_HOME
/etc/alternatives/jre_openjdk
$  $JAVA_HOME/bin/java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
$ rpm -qa \*jni\* | sort 
hawtjni-1.16-3.fc29.noarch
hawtjni-runtime-1.16-3.fc29.noarch
$ rpm -qa \*java\* | sort 
java-11-openjdk-headless-11.0.1.13-4.fc29.x86_64
java-1.8.0-openjdk-headless-1.8.0.191.b12-8.fc29.x86_64
java-openjdk-headless-10.0.2.13-7.fc29.x86_64
javapackages-filesystem-5.3.0-1.fc29.noarch
javapackages-tools-5.3.0-1.fc29.noarch
tzdata-java-2018g-1.fc29.noarch
{code}

Unfortunately, I cannot easily compare F28 and F29 as I never have access to 
them simultaneously.


was (Author: suvayu):
Okay, to summarise: my initial build issue on F28 was resolved by installing 
the llvm-static libraries.

On F29, cmake cannot find the correct version of LLVM.
{code}
$ export ARROW_HOME=~/opt 
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

/usr/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0
/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0

Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:25 (find_package)
{code}
Fedora provides alternate llvm versions installed in subdirectories, so I tried 
specifying {{LLVM_DIR}} when invoking cmake.
{code}
$ ls /usr/lib64/llvm6.0/
bin  include  lib
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON \
-DLLVM_DIR=/usr/lib64/llvm6.0 ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-28 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702776#comment-16702776
 ] 

Suvayu Ali commented on ARROW-3874:
---

Okay, to summarise: my initial build issue on F28 was resolved by installing 
the llvm-static libraries.

On F29, cmake cannot find the correct version of LLVM.
{code}
$ export ARROW_HOME=~/opt 
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

/usr/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0
/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0

Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:25 (find_package)
{code}
Fedora provides alternate llvm versions installed in subdirectories, so I tried 
specifying {{LLVM_DIR}} when invoking cmake.
{code}
$ ls /usr/lib64/llvm6.0/
bin  include  lib
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON \
-DLLVM_DIR=/usr/lib64/llvm6.0 ../
[...]
CMake Error at cmake_modules/FindLLVM.cmake:24 (find_package):
  Could not find a configuration file for package "LLVM" that is compatible
  with requested version "6.0".

  The following configuration files were considered but not accepted:

/usr/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0
/lib64/cmake/llvm/LLVMConfig.cmake, version: 7.0.0

Call Stack (most recent call first):
  src/gandiva/CMakeLists.txt:25 (find_package)
{code}
So I patched {{find_library}} {{(see }}[^arrow-cmake-findllvm.patch]), that 
fixes the LLVM issue, but then I encounter the following Java issue
{code}
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PARQUET=on -DARROW_ORC=ON -DARROW_PLASMA=on -DARROW_GANDIVA=ON \
-DLLVM_DIR=/usr/lib64/llvm6.0 ../
[...]
-- Found LLVM 6.0.1
-- Using LLVMConfig.cmake in: /usr/lib64/llvm6.0/lib/cmake/llvm
-- Found clang /usr/lib64/ccache/clang
-- Found llvm-link /usr/lib64/llvm6.0/bin/llvm-link
CMake Error at /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:137 
(message):
  Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_INCLUDE_PATH
  JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
Call Stack (most recent call first):
  /usr/share/cmake/Modules/FindPackageHandleStandardArgs.cmake:378 
(_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake/Modules/FindJNI.cmake:356 (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
  src/gandiva/jni/CMakeLists.txt:21 (find_package)
{code}
My Java setup
{code}
$ echo $JAVA_HOME
/etc/alternatives/jre_openjdk
$  $JAVA_HOME/bin/java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
$ rpm -qa \*jni\* | sort 
hawtjni-1.16-3.fc29.noarch
hawtjni-runtime-1.16-3.fc29.noarch
$ rpm -qa \*java\* | sort 
java-11-openjdk-headless-11.0.1.13-4.fc29.x86_64
java-1.8.0-openjdk-headless-1.8.0.191.b12-8.fc29.x86_64
java-openjdk-headless-10.0.2.13-7.fc29.x86_64
javapackages-filesystem-5.3.0-1.fc29.noarch
javapackages-tools-5.3.0-1.fc29.noarch
tzdata-java-2018g-1.fc29.noarch
{code}

Unfortunately, I cannot easily compare F28 and F29 as I never have access to 
them simultaneously.

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring 

[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-28 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3874:
--
Attachment: arrow-cmake-findllvm.patch

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log, 
> arrow-cmake-findllvm.patch
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-28 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3874:
--
Environment: 
Fedora 29, master (1013a1dc)
gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
cmake version 3.12.1

  was:
Fedora 29, master (1013a1dc)
gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
llvm (7.0.0 and 6.0.1)


> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 29, master (1013a1dc)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 7.0.0 (default) and 6.0.1 (parallel installed package from Fedora repos)
> cmake version 3.12.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-26 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699406#comment-16699406
 ] 

Suvayu Ali commented on ARROW-3874:
---

Thanks for the link.

In the meantime, I tried to build with *Gandiva* on Fedora 29, and it failed to 
detect LLVM (my original attempt was on F28, which was resolved by installing 
the static libraries).

On F29 the default version is 7, while other versions like 6.0 are installed in 
subdirectories (e.g. {{/usr/lib64/llvm6.0}}). Setting {{-DLLVM_DIR=/path}} 
doesn't help, I had to add {{LLVM_DIR}} to {{find_package}} in 
{{FindLLVM.cmake}}.

While the edit resolved the LLVM issue, cmake failed again unable to find 
{{JAVA_AWT_JNI}} (don't remember exactly, not on F29 now). I couldn't figure 
out if it was something missing, or if cmake was unable to detect again.  I'm 
unsure how to report this, do I update this bug report and change the platform 
from F28 to F29, or do I close this and open a fresh one?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698439#comment-16698439
 ] 

Suvayu Ali edited comment on ARROW-3874 at 11/26/18 3:13 AM:
-

I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}

I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?


was (Author: suvayu):
I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}
I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698439#comment-16698439
 ] 

Suvayu Ali commented on ARROW-3874:
---

I had installed {{llvm-devel}} using dnf.  cmake worked fine after installing 
{{llvm-static}}. Thanks!

But during the build I also noticed, many already installed libraries are being 
downloaded:
{code:java}
[  2%] Performing download step (download, verify and extract) for 'protobuf_ep'
[  2%] Performing download step (download, verify and extract) for 'thrift_ep'
{code}
I have these installed:
{code:java}
$ rpm -qa thrift\* protobuf\* 
protobuf-3.5.0-4.fc28.x86_64
protobuf-compiler-3.5.0-4.fc28.x86_64
protobuf-java-3.5.0-4.fc28.noarch
protobuf-c-1.3.0-4.fc28.x86_64
protobuf-devel-3.5.0-4.fc28.x86_64
protobuf-lite-3.5.0-4.fc28.x86_64
thrift-devel-0.10.0-9.fc28.x86_64
thrift-0.10.0-9.fc28.x86_64
{code}

Am I missing some libraries there as well?

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected

2018-11-25 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-3874:
-

 Summary: [Gandiva] Cannot build: LLVM not detected
 Key: ARROW-3874
 URL: https://issues.apache.org/jira/browse/ARROW-3874
 Project: Apache Arrow
  Issue Type: Bug
  Components: Gandiva
Affects Versions: 0.12.0
 Environment: Fedora 28, master (8d5bfc65)
gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
llvm 6.0.1
Reporter: Suvayu Ali
 Attachments: CMakeError.log, CMakeOutput.log

I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
detecting LLVM on the system.
{code}
$ cd build/data-an/arrow/arrow/cpp/
$ export ARROW_HOME=/opt/data-an
$ mkdir release
$ cd release/
$ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
-DARROW_GANDIVA=ON ../
[...]
-- Found LLVM 6.0.1
-- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
  Target X86 is not in the set of libraries.
Call Stack (most recent call first):
  cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
  src/gandiva/CMakeLists.txt:25 (find_package)


-- Configuring incomplete, errors occurred!
{code}
The cmake log files are attached.

When I invoke cmake with options other than *Gandiva*, it finishes successfully.

Here are the llvm libraries that are installed on my system:
{code}
$ rpm -qa llvm\* | sort
llvm3.9-libs-3.9.1-13.fc28.x86_64
llvm4.0-libs-4.0.1-5.fc28.x86_64
llvm-6.0.1-8.fc28.x86_64
llvm-devel-6.0.1-8.fc28.x86_64
llvm-libs-6.0.1-8.fc28.i686
llvm-libs-6.0.1-8.fc28.x86_64
$ ls /usr/lib64/libLLVM* /usr/include/llvm
/usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so

/usr/include/llvm:
ADT  FuzzMutate  Object Support
Analysis InitializePasses.h  ObjectYAML TableGen
AsmParserIR  Option Target
BinaryFormat IRReaderPassAnalysisSupport.h  Testing
Bitcode  LineEditor  Passes ToolDrivers
CodeGen  LinkAllIR.h Pass.h Transforms
Config   LinkAllPasses.h PassInfo.h WindowsManifest
DebugInfoLinker  PassRegistry.h WindowsResource
Demangle LTO PassSupport.h  XRay
ExecutionEngine  MC  ProfileData
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3874) [Gandiva] Cannot build: LLVM not detected correctly

2018-11-25 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3874:
--
Summary: [Gandiva] Cannot build: LLVM not detected correctly  (was: 
[Gandiva] Cannot build: LLVM not detected)

> [Gandiva] Cannot build: LLVM not detected correctly
> ---
>
> Key: ARROW-3874
> URL: https://issues.apache.org/jira/browse/ARROW-3874
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Gandiva
>Affects Versions: 0.12.0
> Environment: Fedora 28, master (8d5bfc65)
> gcc (GCC) 8.2.1 20181105 (Red Hat 8.2.1-5)
> llvm 6.0.1
>Reporter: Suvayu Ali
>Priority: Major
>  Labels: cmake
> Attachments: CMakeError.log, CMakeOutput.log
>
>
> I cannot build Arrow with {{-DARROW_GANDIVA=ON}}. {{cmake}} fails while 
> detecting LLVM on the system.
> {code}
> $ cd build/data-an/arrow/arrow/cpp/
> $ export ARROW_HOME=/opt/data-an
> $ mkdir release
> $ cd release/
> $ cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$ARROW_HOME 
> -DARROW_GANDIVA=ON ../
> [...]
> -- Found LLVM 6.0.1
> -- Using LLVMConfig.cmake in: /usr/lib64/cmake/llvm
> CMake Error at /usr/lib64/cmake/llvm/LLVM-Config.cmake:175 (message):
>   Target X86 is not in the set of libraries.
> Call Stack (most recent call first):
>   cmake_modules/FindLLVM.cmake:31 (llvm_map_components_to_libnames)
>   src/gandiva/CMakeLists.txt:25 (find_package)
> -- Configuring incomplete, errors occurred!
> {code}
> The cmake log files are attached.
> When I invoke cmake with options other than *Gandiva*, it finishes 
> successfully.
> Here are the llvm libraries that are installed on my system:
> {code}
> $ rpm -qa llvm\* | sort
> llvm3.9-libs-3.9.1-13.fc28.x86_64
> llvm4.0-libs-4.0.1-5.fc28.x86_64
> llvm-6.0.1-8.fc28.x86_64
> llvm-devel-6.0.1-8.fc28.x86_64
> llvm-libs-6.0.1-8.fc28.i686
> llvm-libs-6.0.1-8.fc28.x86_64
> $ ls /usr/lib64/libLLVM* /usr/include/llvm
> /usr/lib64/libLLVM-6.0.1.so  /usr/lib64/libLLVM-6.0.so  /usr/lib64/libLLVM.so
> /usr/include/llvm:
> ADT  FuzzMutate  Object Support
> Analysis InitializePasses.h  ObjectYAML TableGen
> AsmParserIR  Option Target
> BinaryFormat IRReaderPassAnalysisSupport.h  Testing
> Bitcode  LineEditor  Passes ToolDrivers
> CodeGen  LinkAllIR.h Pass.h Transforms
> Config   LinkAllPasses.h PassInfo.h WindowsManifest
> DebugInfoLinker  PassRegistry.h WindowsResource
> Demangle LTO PassSupport.h  XRay
> ExecutionEngine  MC  ProfileData
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3806) [Python] When converting nested types to pandas, use tuples

2018-11-16 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-3806:
-

 Summary: [Python] When converting nested types to pandas, use 
tuples
 Key: ARROW-3806
 URL: https://issues.apache.org/jira/browse/ARROW-3806
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.11.1
 Environment: Fedora 29, pyarrow installed with conda
Reporter: Suvayu Ali


When converting to pandas, convert nested types (e.g. list) to tuples.  Columns 
with lists are difficult to query.  Here are a few unsuccessful attempts:

{code}
>>> mini
CHROMPOS   IDREFALTS  QUAL
80 20  63521  rs191905748  G [A]   100
81 20  63541  rs117322527  C [A]   100
82 20  63548  rs541129280  G[GT]   100
83 20  63553  rs536661806  T [C]   100
84 20  63555  rs553463231  T [C]   100
85 20  63559  rs138359120  C [A]   100
86 20  63586  rs545178789  T [G]   100
87 20  63636  rs374311122  G [A]   100
88 20  63696  rs149160003  A [G]   100
89 20  63698  rs544072005  A [C]   100
90 20  63729  rs181483669  G [A]   100
91 20  63733   rs75670495  C [T]   100
92 20  63799rs1418258  C [T]   100
93 20  63808   rs76004960  G [C]   100
94 20  63813  rs532151719  G [A]   100
95 20  63857  rs543686274  CCTGGAAAGGATT [C]   100
96 20  63865  rs551938596  G [A]   100
97 20  63902  rs571779099  A [T]   100
98 20  63963  rs531152674  G [A]   100
99 20  63967  rs116770801  A [G]   100
10020  63977  rs199703510  C [G]   100
10120  64016  rs143263863  G [A]   100
10220  64062  rs148297240  G [A]   100
10320  64139  rs186497980  G  [A, T]   100
10420  64150rs7274499  C [A]   100
10520  64151  rs190945171  C [T]   100
10620  64154  rs537656456  T [G]   100
10720  64175  rs116531220  A [G]   100
10820  64186  rs141793347  C [G]   100
10920  64210  rs182418654  G [C]   100
11020  64303  rs559929739  C [A]   100
{code}

# I think this one fails because it tries to broadcast the comparison.
{code}
>>> mini[mini.ALTS == ["A", "T"]]
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1283, in wrapper
res = na_op(values, other)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
  File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 31 vs 2
{code}
# I think this fails due to a similar reason, but the broadcasting is happening 
at a different place.
{code}
>>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 
2682, in __getitem__
return self._getitem_array(key)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 
2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", 
line 1314, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
 line 3259, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
  File "pandas/_libs/index.pyx", line 301, in 
pandas._libs.index.IndexEngine.get_indexer
  File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in 
pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
>>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
80 [True, False]
81 [True, False]
82[False, False]
83[False, False]
84[False, False]
{code}
# Unfortunately this clever hack fails as well!
{code}
>>> c = np.empty(1, object)
>>> c[0] = ["A", "T"]
>>> mini[mini.ALTS.values == c]
Traceback (most recent call last):
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
 line 3078, in get_loc
return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in 
pandas._libs.index.IndexEngine.get_loc
  File 

[jira] [Updated] (ARROW-3792) [PARQUET] Segmentation fault when writing empty RecordBatches

2018-11-14 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3792:
--
Description: 
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
{{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)
{code}

I was getting a segmentation fault at the final step, I narrowed it down to a 
specific iteration. I noticed that iteration had empty batches; specifically, 
[0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the whole 
dataset is below:

{code:python}
[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]
{code}

On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
unfortunately I couldn't create a proper minimal example with synthetic data.

h2. Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for 
many years, so we can be reasonably sure the data is good. The following steps 
should help you replicate the issue.

# Download the data file (and index), about 330MB:
{code:bash}
$ wget 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
{code}
# Install the Cython library {{pysam}}, a thin wrapper around the reference 
implementation of the VCF file spec. You will need {{zlib}} headers, but that's 
probably not a problem :)
{code:bash}
$ pip3 install --user pysam
{code}
# Now you can use the attached script to replicate the crash.

h2. Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown 
below (maybe it helps, this is how I realised empty batches could be the 
reason).

{code}
(gdb) bt
#0  0x7f3e7676d670 in 
parquet::TypedColumnWriter 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::TypedWriteBatch,
 arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x7f3e7673df09 in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x7f3e7673c74d in 
parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x7f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x7f3e731e3a51 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}

  was:
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
{{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)

[jira] [Updated] (ARROW-3792) [PARQUET] Segmentation fault when writing empty RecordBatches

2018-11-14 Thread Suvayu Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suvayu Ali updated ARROW-3792:
--
Description: 
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
{{RecordBatch}}. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)
{code}

I was getting a segmentation fault at the final step, I narrowed it down to a 
specific iteration. I noticed that iteration had empty batches; specifically, 
[0, 0, 2876, 14423]. The number of rows for each {{RecordBatch}} for the whole 
dataset is below:

{code:python}
[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]
{code}

On excluding the empty {{RecordBatch}}-es, the segfault goes away, but 
unfortunately I couldn't create a proper minimal example with synthetic data.

h2. Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for 
many years, so we can be reasonably sure the data is good. The following steps 
should help you replicate the issue.

# Download the data file (and index), about 330MB:
   {code:bash}
   $ wget 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
   {code}
# Install the Cython library {{pysam}}, a thin wrapper around the reference 
implementation of the VCF file spec. You will need {{zlib}} headers, but that's 
probably not a problem :)
   {code:bash}
   $ pip3 install --user pysam
   {code}
# Now you can use the attached script to replicate the crash.

h2. Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown 
below (maybe it helps, this is how I realised empty batches could be the 
reason).

{code}
(gdb) bt
#0  0x7f3e7676d670 in 
parquet::TypedColumnWriter 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::TypedWriteBatch,
 arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x7f3e7673df09 in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x7f3e7673c74d in 
parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x7f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x7f3e731e3a51 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}

  was:
h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
RecordBatch. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = 

[jira] [Created] (ARROW-3792) [PARQUET] Segmentation fault when writing empty RecordBatches

2018-11-14 Thread Suvayu Ali (JIRA)
Suvayu Ali created ARROW-3792:
-

 Summary: [PARQUET] Segmentation fault when writing empty 
RecordBatches
 Key: ARROW-3792
 URL: https://issues.apache.org/jira/browse/ARROW-3792
 Project: Apache Arrow
  Issue Type: Bug
  Components: Format
Affects Versions: 0.11.1
 Environment: Fedora 28, pyarrow installed with pip
Fedora 29, pyarrow installed from conda-forge
Reporter: Suvayu Ali
 Attachments: pq-bug.py

h2. Background

I am trying to convert a very sparse dataset to parquet (~3% rows in a range 
are populated). The file I am working with spans upto ~63M rows. I decided to 
iterate in batches of 500k rows, 127 batches in total. Each row batch is a 
RecordBatch. I create 4 batches at a time, and write to a parquet file 
incrementally. Something like this:

{code:python}
batches = [..]  # 4 batches
tbl = pa.Table.from_batches(batches)
pqwriter.write_table(tbl, row_group_size=15000)
# same issue with pq.write_table(..)
{code}

I was getting a segmentation fault at the final step, I narrowed it down to a 
specific iteration. I noticed that iteration had empty batches; specifically, 
[0, 0, 2876, 14423]. The number of rows for each RecordBatch for the whole 
dataset is below:

{code:python}
[14050, 16398, 14080, 14920, 15527, 14288, 15040, 14733, 15345, 15799,
15728, 15942, 14734, 15241, 15721, 15255, 14167, 14009, 13753, 14800,
14554, 14287, 15393, 14766, 16600, 15675, 14072, 13263, 12906, 14167,
14455, 15428, 15129, 16141, 15478, 16257, 14639, 14887, 14919, 15535,
13973, 14334, 13286, 15038, 15951, 17252, 15883, 19903, 16967, 16878,
15845, 12205, 8761, 0, 0, 0, 0, 0, 2876, 14423, 13557, 12723, 14330,
15452, 13551, 12723, 12396, 13531, 13539, 11512, 13175, 13941, 14634,
15515, 14239, 13856, 13873, 14154, 14822, 13543, 14653, 15328, 16171,
15101, 150 55, 15194, 14058, 13706, 14747, 14650, 14694, 15397, 15122,
16055, 16635, 14153, 14665, 14781, 15462, 15426, 16150, 14632, 14532,
15139, 15324, 15279, 16075, 16394, 16834, 15391, 16320, 1650 4, 17248,
15913, 15341, 14754, 16637, 15695, 16642, 18143, 19481, 19072, 15742,
18807, 18789, 14258, 0, 0]
{code}

On excluding the empty RecordBatch-es, the segfault goes away, but 
unfortunately I couldn't create a proper minimal example with synthetic data.

h2. Not quite minimal example

The data I am using is from the 1000 Genome project, which has been public for 
many years, so we can be reasonably sure the data is good. The following steps 
should help you replicate the issue.

# Download the data file (and index), about 330MB:
   {code:bash}
   $ wget 
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz{,.tbi}
   {code}
# Install the Cython library pysam, a thin wrapper around the reference 
implementation of the VCF file spec. You will need zlib headers, but that's 
probably not a problem :)
   {code:bash}
   $ pip3 install --user pysam
   {code}
# Now you can use the attached script to replicate the crash.

h2. Extra information

I have tried attaching gdb, the backtrace when the segfault occurs is shown 
below (maybe it helps, this is how I realised empty batches could be the 
reason).

{code}
(gdb) bt
#0  0x7f3e7676d670 in 
parquet::TypedColumnWriter 
>::WriteMiniBatch(long, short const*, short const*, parquet::ByteArray const*) 
()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#1  0x7f3e76733d1e in arrow::Status parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::TypedWriteBatch,
 arrow::BinaryType>(arrow::Array const&, long, short const*, short const*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#2  0x7f3e7673a3d4 in parquet::arrow::(anonymous 
namespace)::ArrowColumnWriter::Write(arrow::Array const&) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#3  0x7f3e7673df09 in 
parquet::arrow::FileWriter::Impl::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#4  0x7f3e7673c74d in 
parquet::arrow::FileWriter::WriteColumnChunk(std::shared_ptr
 const&, long, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#5  0x7f3e7673c8d2 in parquet::arrow::FileWriter::WriteTable(arrow::Table 
const&, long) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/../../../libparquet.so.11
#6  0x7f3e731e3a51 in 
__pyx_pw_7pyarrow_8_parquet_13ParquetWriter_5write_table(_object*, _object*, 
_object*) ()
   from 
/home/user/miniconda3/lib/python3.6/site-packages/pyarrow/_parquet.cpython-36m-x86_64-linux-gnu.so
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1956) Support reading specific partitions from a partitioned parquet dataset

2017-12-29 Thread Suvayu Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16306537#comment-16306537
 ] 

Suvayu Ali commented on ARROW-1956:
---

Hi Wes,

Inspired by the way PySpark does it, I propose the following.

* Writing partitioned datasets:
{code:none}
writer = PartitionedParquetWriter(basepath, partitions, schema, ...)
{code}
  Rest of the arguments could be identical to ParquetWriter.  For that
  matter, we can also have:
{code:java}
writer = ParquetWriter(where, ..., compression='snappy', partitions=[])
{code}
  For a single file, all constructor arguments are as it is currently,
  and `partitions` is ignored, however when `where` is a directory,
  `partitions` must be a list of column names to partition on.

* Reading partitioned datasets:
{code:java}
dst = ParquetDataset(path_or_paths, validate_schema=True, basepath=None)
{code}
  When `basepath` is `None`, we have the current behaviour, whereas if
  `basepath` is a path, directory hierarchies are detected in
  `path_or_paths`, and each sub-directory is treated as a parquet
  partition in the usual fashion.

What do you think?

If there is someone to provide guidance, I can also work on the implementation. 
 I have lots of free time from the second week of January.

Thanks,

> Support reading specific partitions from a partitioned parquet dataset
> --
>
> Key: ARROW-1956
> URL: https://issues.apache.org/jira/browse/ARROW-1956
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Affects Versions: 0.8.0
> Environment: Kernel: 4.14.8-300.fc27.x86_64
> Python: 3.6.3
>Reporter: Suvayu Ali
>Priority: Minor
>  Labels: parquet
> Fix For: 0.9.0
>
> Attachments: so-example.py
>
>
> I want to read specific partitions from a partitioned parquet dataset.  This 
> is very useful in case of large datasets.  I have attached a small script 
> that creates a dataset and shows what is expected when reading (quoting 
> salient points below).
> # There is no way to read specific partitions in Pandas
> # In pyarrow I tried to achieve the goal by providing a list of 
> files/directories to ParquetDataset, but it didn't work: 
> # In PySpark it works if I simply do:
> {code:none}
> spark.read.options('basePath', 'datadir').parquet(*list_of_partitions)
> {code}
> I also couldn't find a way to easily write partitioned parquet files.  In the 
> end I did it by hand by creating the directory hierarchies, and writing the 
> individual files myself (similar to the implementation in the attached 
> script).  Again, in PySpark I can do 
> {code:none}
> df.write.partitionBy(*list_of_partitions).parquet(output)
> {code}
> to achieve that.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)