date:20180208

[jira] [Commented] (ARROW-2083) Support skipping builds

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356639#comment-16356639
 ] 

ASF GitHub Bot commented on ARROW-2083:
---

pitrou commented on a change in pull request #1568: ARROW-2083: [CI] Detect 
changed components on Travis-CI
URL: https://github.com/apache/arrow/pull/1568#discussion_r166860298
 
 

 ##
 File path: .travis.yml
 ##
 @@ -61,94 +63,110 @@ matrix:
 - export CXX="clang++-4.0"
 - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh
 - $TRAVIS_BUILD_DIR/ci/travis_lint.sh
-- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi
+  # [OS X] C++ & Python w/ XCode 6.4
   - compiler: clang
 language: cpp
 osx_image: xcode6.4
 os: osx
 cache:
 addons:
 before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 - export ARROW_TRAVIS_USE_TOOLCHAIN=1
 - export ARROW_TRAVIS_PLASMA=1
 - export ARROW_TRAVIS_ORC=1
 - export ARROW_BUILD_WARNING_LEVEL=CHECKIN
-- travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then travis_wait 50 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi
+  # [manylinux1] Python
   - language: cpp
 before_script:
-- docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then docker pull 
quay.io/xhochy/arrow_manylinux1_x86_64_base:latest; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh; fi
+  # Java w/ OpenJDK 7
   - language: java
 os: linux
 jdk: openjdk7
+before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
+- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi
+- if [ $ARROW_CI_SITE_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_site.sh; fi
 
 Review comment:
   Oh, I see. I thought `mvn site` rebult the whole Web site. Apparently I was 
mistaken.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support skipping builds
> ---
>
> Key: ARROW-2083
> URL: https://issues.apache.org/jira/browse/ARROW-2083
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> While appveyor supports a  [skip appveyor] you cannot skip only travis. What 
> is the feeling about adding e.g. 
> [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] 
> to our build. We could also do some simple kind of change detection that we 
> don't build the C++/Python parts and only Java and the integration tests if 
> there was a change in the PR that only affects Java.
> I think it might be worthwhile to spend a bit on that to get a bit of load of 
> the CI infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2083) Support skipping builds

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356642#comment-16356642
 ] 

ASF GitHub Bot commented on ARROW-2083:
---

pitrou commented on issue #1568: ARROW-2083: [CI] Detect changed components on 
Travis-CI
URL: https://github.com/apache/arrow/pull/1568#issuecomment-364039703
 
 
   Ok, I addressed the review comment and rebased.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support skipping builds
> ---
>
> Key: ARROW-2083
> URL: https://issues.apache.org/jira/browse/ARROW-2083
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> While appveyor supports a  [skip appveyor] you cannot skip only travis. What 
> is the feeling about adding e.g. 
> [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] 
> to our build. We could also do some simple kind of change detection that we 
> don't build the C++/Python parts and only Java and the integration tests if 
> there was a change in the PR that only affects Java.
> I think it might be worthwhile to spend a bit on that to get a bit of load of 
> the CI infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-1632:
---
Fix Version/s: (was: 0.10.0)
   0.9.0

> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-1632:
--

Assignee: Uwe L. Korn

> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis

2018-02-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356687#comment-16356687
 ] 

Uwe L. Korn commented on ARROW-1632:


Keeping this in 0.9 for now, I will have a look if I can get it done in time. 

> [Python] Permit categorical conversions in Table.to_pandas on a per-column 
> basis
> 
>
> Key: ARROW-1632
> URL: https://issues.apache.org/jira/browse/ARROW-1632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> Currently this is all or nothing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1975) [C++] Add abi-compliance-checker to build process

2018-02-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356693#comment-16356693
 ] 

Uwe L. Korn commented on ARROW-1975:


[~wesmckinn] yes, checking this automatically would save me from quite some 
follow-up work on releases.

> [C++] Add abi-compliance-checker to build process
> -
>
> Key: ARROW-1975
> URL: https://issues.apache.org/jira/browse/ARROW-1975
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> I would like to check our baseline modules with 
> https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades 
> are much smoother and that we don‘t break the ABI in patch releases. 
> As we‘re pre-1.0 yet, I accept that there will be breakage but I would like 
> to keep them to a minimum. Currently the biggest pain with Arrow is you need 
> to pin it in Python always with {{==0.x.y}}, otherwise segfaults are 
> inevitable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

2018-02-08 Thread Michal Danko (JIRA)

Michal Danko created ARROW-2113:
---

 Summary: [Python] Connect to hdfs failing with 
"pyarrow.lib.ArrowIOError: HDFS connection failed"
 Key: ARROW-2113
 URL: https://issues.apache.org/jira/browse/ARROW-2113
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
 Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
5.13.1
Reporter: Michal Danko


Steps to replicate the issue:

mkdir /tmp/test
cd /tmp/test
mkdir jars
cd jars
touch test1.jar
mkdir -p ../lib/zookeeper
cd ../lib/zookeeper
ln -s ../../jars/test1.jar ./test1.jar
ln -s test1.jar test.jar
mkdir -p ../hadoop/lib
cd ../hadoop/lib
ln -s ../../../lib/zookeeper/test.jar ./test.jar

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Ends with error:


loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=pa) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
Traceback (most recent call last): (
 File "", line 1, in 
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, 
in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
pyarrow.lib.ArrowIOError: HDFS connection failed
-

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, 
because it's path to same .jar, just with extra symlink in it. To me, it looks 
like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from 
Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie 
workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2083) Support skipping builds

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356750#comment-16356750
 ] 

ASF GitHub Bot commented on ARROW-2083:
---

xhochy closed pull request #1568: ARROW-2083: [CI] Detect changed components on 
Travis-CI
URL: https://github.com/apache/arrow/pull/1568
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.travis.yml b/.travis.yml
index 58d6786aa..d591a9922 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -46,11 +46,13 @@ matrix:
   allow_failures:
   - jdk: oraclejdk9
   include:
+  # C++ & Python w/ clang 4.0
   - compiler: gcc
 language: cpp
 os: linux
 group: deprecated
 before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 - export ARROW_TRAVIS_USE_TOOLCHAIN=1
 - export ARROW_TRAVIS_VALGRIND=1
 - export ARROW_TRAVIS_PLASMA=1
@@ -61,12 +63,13 @@ matrix:
 - export CXX="clang++-4.0"
 - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh
 - $TRAVIS_BUILD_DIR/ci/travis_lint.sh
-- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi
+  # [OS X] C++ & Python w/ XCode 6.4
   - compiler: clang
 language: cpp
 osx_image: xcode6.4
@@ -74,81 +77,96 @@ matrix:
 cache:
 addons:
 before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 - export ARROW_TRAVIS_USE_TOOLCHAIN=1
 - export ARROW_TRAVIS_PLASMA=1
 - export ARROW_TRAVIS_ORC=1
 - export ARROW_BUILD_WARNING_LEVEL=CHECKIN
-- travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then travis_wait 50 
$TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7
-- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6
+- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi
+  # [manylinux1] Python
   - language: cpp
 before_script:
-- docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then docker pull 
quay.io/xhochy/arrow_manylinux1_x86_64_base:latest; fi
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh
+- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh; fi
+  # Java w/ OpenJDK 7
   - language: java
 os: linux
 jdk: openjdk7
+before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
+- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi
+- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_javadoc.sh; fi
+  # Java w/ Oracle JDK 9
   - language: java
 os: linux
-env: ARROW_TRAVIS_SKIP_SITE=yes
 jdk: oraclejdk9
+before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 script:
-- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
+- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then 
$TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi
 addons:
   apt:
 packages:
   - oracle-java9-installer
+  # Integration w/ OpenJDK 8
   - language: java
 os: linux
 env: ARROW_TEST_GROUP=integration
 jdk: openjdk8
 before_script:
+- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py`
 - source $TRAVIS_BUILD_DIR/ci/t

[jira] [Resolved] (ARROW-2083) Support skipping builds

2018-02-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2083.

   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1568
[https://github.com/apache/arrow/pull/1568]

> Support skipping builds
> ---
>
> Key: ARROW-2083
> URL: https://issues.apache.org/jira/browse/ARROW-2083
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> While appveyor supports a  [skip appveyor] you cannot skip only travis. What 
> is the feeling about adding e.g. 
> [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] 
> to our build. We could also do some simple kind of change detection that we 
> don't build the C++/Python parts and only Java and the integration tests if 
> there was a change in the PR that only affects Java.
> I think it might be worthwhile to spend a bit on that to get a bit of load of 
> the CI infrastructure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356752#comment-16356752
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

xhochy commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364062240
 
 
   @icexelloss @wesm Keep it in Python for now. In future, we should merge all 
documentations into a single sphinx setup. As long as we have not done this, 
Python is a good default place as it is already on sphinx as well as currently 
the most detailed documentation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356767#comment-16356767
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

ts-dpb commented on issue #1575: ARROW-1425: [Python] Document Arrow 
timestamps, and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364069670
 
 
   It was puzzling to the author and me where to place the new piece of
   documentation – we looked for a top-level doc directory but there was none.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356768#comment-16356768
 ] 

Antoine Pitrou commented on ARROW-1021:
---

What is the status of {{arrow/python/api.h}}? It looks more like an internal 
helper compared to {{arrow/python/pyarrow.h}}.

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1021:
--
Labels: pull-request-available  (was: )

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356785#comment-16356785
 ] 

ASF GitHub Bot commented on ARROW-1021:
---

pitrou opened a new pull request #1576: ARROW-1021: [Python] Add documentation 
for C++ pyarrow API
URL: https://github.com/apache/arrow/pull/1576
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356809#comment-16356809
 ] 

Antoine Pitrou commented on ARROW-1021:
---

By the way, what's the intended use of {{pyarrow/public-api.pxi}}? The hyphen 
makes it non-cimportable:
{code}
Error compiling Cython file:

...

from pyarrow.public-api cimport *
  ^


ttt.pyx:2:19: Expected 'import' or 'cimport'
{code}


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

2018-02-08 Thread Michal Danko (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Danko updated ARROW-2113:

Description: 
Steps to replicate the issue:

mkdir /tmp/test
 cd /tmp/test
 mkdir jars
 cd jars
 touch test1.jar
 mkdir -p ../lib/zookeeper
 cd ../lib/zookeeper
 ln -s ../../jars/test1.jar ./test1.jar
 ln -s test1.jar test.jar
 mkdir -p ../hadoop/lib
 cd ../hadoop/lib
 ln -s ../../../lib/zookeeper/test.jar ./test.jar

(this part depends on your configuration you need those values for pyarrow.hdfs 
to work:)

(path to libjvm:)

(export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)

(path to libhdfs:)

(export 
LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Ends with error:


 loadFileSystems error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=pa) error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 Traceback (most recent call last): (
 File "", line 1, in 
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, 
in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
 pyarrow.lib.ArrowIOError: HDFS connection failed
 -

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
 python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, 
because it's path to same .jar, just with extra symlink in it. To me, it looks 
like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from 
Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie 
workflows.

  was:
Steps to replicate the issue:

mkdir /tmp/test
cd /tmp/test
mkdir jars
cd jars
touch test1.jar
mkdir -p ../lib/zookeeper
cd ../lib/zookeeper
ln -s ../../jars/test1.jar ./test1.jar
ln -s test1.jar test.jar
mkdir -p ../hadoop/lib
cd ../hadoop/lib
ln -s ../../../lib/zookeeper/test.jar ./test.jar

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Ends with error:


loadFileSystems error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=pa) error:
(unable to get root cause for java.lang.NoClassDefFoundError)
(unable to get stack trace for java.lang.NoClassDefFoundError)
Traceback (most recent call last): (
 File "", line 1, in 
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, 
in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
pyarrow.lib.ArrowIOError: HDFS connection failed
-

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
python
import pyarrow.hdfs as hdfs;
fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, 
because it's path to same .jar, just with extra symlink in it. To me, it looks 
like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from 
Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie 
workflows.


> [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS 
> connection failed"
> -

[jira] [Updated] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

2018-02-08 Thread Michal Danko (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Danko updated ARROW-2113:

Description: 
Steps to replicate the issue:

mkdir /tmp/test
 cd /tmp/test
 mkdir jars
 cd jars
 touch test1.jar
 mkdir -p ../lib/zookeeper
 cd ../lib/zookeeper
 ln -s ../../jars/test1.jar ./test1.jar
 ln -s test1.jar test.jar
 mkdir -p ../hadoop/lib
 cd ../hadoop/lib
 ln -s ../../../lib/zookeeper/test.jar ./test.jar

(this part depends on your configuration you need those values for pyarrow.hdfs 
to work: )

(path to libjvm: )

(export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)

(path to libhdfs: )

(export 
LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Ends with error:


 loadFileSystems error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=pa) error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 Traceback (most recent call last): (
 File "", line 1, in 
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, 
in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
 pyarrow.lib.ArrowIOError: HDFS connection failed
 -

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
 python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, 
because it's path to same .jar, just with extra symlink in it. To me, it looks 
like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random, it is path copied from 
Cloudera distribution of Hadoop (original file was zookeeper.jar),

Because of this issue, our customer currently can't use pyarrow lib for oozie 
workflows.

  was:
Steps to replicate the issue:

mkdir /tmp/test
 cd /tmp/test
 mkdir jars
 cd jars
 touch test1.jar
 mkdir -p ../lib/zookeeper
 cd ../lib/zookeeper
 ln -s ../../jars/test1.jar ./test1.jar
 ln -s test1.jar test.jar
 mkdir -p ../hadoop/lib
 cd ../hadoop/lib
 ln -s ../../../lib/zookeeper/test.jar ./test.jar

(this part depends on your configuration you need those values for pyarrow.hdfs 
to work:)

(path to libjvm:)

(export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)

(path to libhdfs:)

(export 
LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)

export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"

python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Ends with error:


 loadFileSystems error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
kerbTicketCachePath=(NULL), userName=pa) error:
 (unable to get root cause for java.lang.NoClassDefFoundError)
 (unable to get stack trace for java.lang.NoClassDefFoundError)
 Traceback (most recent call last): (
 File "", line 1, in 
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
170, in connect
 kerb_ticket=kerb_ticket, driver=driver)
 File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, 
in __init__
 self._connect(host, port, user, kerb_ticket, driver)
 File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
 File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
 pyarrow.lib.ArrowIOError: HDFS connection failed
 -

 

export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
 python
 import pyarrow.hdfs as hdfs;
 fs = hdfs.connect(user="hdfs")

 

Works properly.

 

I can't find reason why first CLASSPATH doesn't work and second one does, 
because it's path to same .jar, just with extra symlink in it. To me, it looks 
like pyarrow.lib.check has problem with symlinks defined with many ../.../.. .

I would expect that pyarrow would work with any definition of path to .jar

Please notice that path are not generated at random

[jira] [Commented] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356878#comment-16356878
 ] 

ASF GitHub Bot commented on ARROW-2073:
---

xhochy closed pull request #1572: ARROW-2073: [Python] Create struct array from 
sequence of tuples
URL: https://github.com/apache/arrow/pull/1572
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/builtin_convert.cc 
b/cpp/src/arrow/python/builtin_convert.cc
index 1e431c29f..f0e5449b6 100644
--- a/cpp/src/arrow/python/builtin_convert.cc
+++ b/cpp/src/arrow/python/builtin_convert.cc
@@ -771,18 +771,21 @@ class StructConverter : public 
TypedConverterVisitorAppend());
-if (!PyDict_Check(obj)) {
-  return Status::TypeError("dict value expected for struct type");
+// Note heterogenous sequences are not allowed
+if (ARROW_PREDICT_FALSE(source_kind_ == UNKNOWN)) {
+  if (PyDict_Check(obj)) {
+source_kind_ = DICTS;
+  } else if (PyTuple_Check(obj)) {
+source_kind_ = TUPLES;
+  }
 }
-// NOTE we're ignoring any extraneous dict items
-for (int i = 0; i < num_fields_; i++) {
-  PyObject* nameobj = PyList_GET_ITEM(field_name_list_.obj(), i);
-  PyObject* valueobj = PyDict_GetItem(obj, nameobj);  // borrowed
-  RETURN_IF_PYERROR();
-  RETURN_NOT_OK(value_converters_[i]->AppendSingle(valueobj ? valueobj : 
Py_None));
+if (PyDict_Check(obj) && source_kind_ == DICTS) {
+  return AppendDictItem(obj);
+} else if (PyTuple_Check(obj) && source_kind_ == TUPLES) {
+  return AppendTupleItem(obj);
+} else {
+  return Status::TypeError("Expected sequence of dicts or tuples for 
struct type");
 }
-
-return Status::OK();
   }
 
   // Append a missing item
@@ -797,9 +800,33 @@ class StructConverter : public 
TypedConverterVisitorAppendSingle(valueobj ? valueobj : 
Py_None));
+}
+return Status::OK();
+  }
+
+  Status AppendTupleItem(PyObject* obj) {
+if (PyTuple_GET_SIZE(obj) != num_fields_) {
+  return Status::Invalid("Tuple size must be equal to number of struct 
fields");
+}
+for (int i = 0; i < num_fields_; i++) {
+  PyObject* valueobj = PyTuple_GET_ITEM(obj, i);
+  RETURN_NOT_OK(value_converters_[i]->AppendSingle(valueobj));
+}
+return Status::OK();
+  }
+
   std::vector> value_converters_;
   OwnedRef field_name_list_;
   int num_fields_;
+  // Whether we're converting from a sequence of dicts or tuples
+  enum { UNKNOWN, DICTS, TUPLES } source_kind_ = UNKNOWN;
 };
 
 class DecimalConverter
diff --git a/python/benchmarks/convert_builtins.py 
b/python/benchmarks/convert_builtins.py
index 92b2b850f..a4dc9f262 100644
--- a/python/benchmarks/convert_builtins.py
+++ b/python/benchmarks/convert_builtins.py
@@ -144,11 +144,21 @@ def generate_int_list_list(self, n, min_size, max_size,
 partial(self.generate_int_list, none_prob=none_prob),
 n, min_size, max_size, none_prob)
 
+def generate_tuple_list(self, n, none_prob=DEFAULT_NONE_PROB):
+"""
+Generate a list of tuples with random values.
+Each tuple has the form `(int value, float value, bool value)`
+"""
+dicts = self.generate_dict_list(n, none_prob=none_prob)
+tuples = [(d.get('u'), d.get('v'), d.get('w'))
+  if d is not None else None
+  for d in dicts]
+assert len(tuples) == n
+return tuples
 
 def generate_dict_list(self, n, none_prob=DEFAULT_NONE_PROB):
 """
-Generate a list of dicts with a random size between *min_size* and
-*max_size*.
+Generate a list of dicts with random values.
 Each dict has the form `{'u': int value, 'v': float value, 'w': bool 
value}`
 """
 ints = self.generate_int_list(n, none_prob=none_prob)
@@ -179,12 +189,14 @@ def get_type_and_builtins(self, n, type_name):
 """
 size = None
 
-if type_name in ('bool', 'ascii', 'unicode', 'int64 list', 'struct'):
+if type_name in ('bool', 'ascii', 'unicode', 'int64 list'):
 kind = type_name
 elif type_name.startswith(('int', 'uint')):
 kind = 'int'
 elif type_name.startswith('float'):
 kind = 'float'
+elif type_name.startswith('struct'):
+kind = 'struct'
 elif type_name == 'binary':
 kind = 'varying binary'
 elif type_name.startswith('binary'):
@@ -226,6 +238,7 @@ def get_type_and_builtins(self, n, type_name):
 'int64 list': partial(self.generate_int_list_list,
   min_size=0, max_size=20),
 'struct': sel

[jira] [Resolved] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type

2018-02-08 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2073.

   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1572
[https://github.com/apache/arrow/pull/1572]

> [Python] Create StructArray from sequence of tuples given a known data type
> ---
>
> Key: ARROW-2073
> URL: https://issues.apache.org/jira/browse/ARROW-2073
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Following ARROW-1705, we should support calling {{pa.array}} with a sequence 
> of tuples, presuming a struct type is passed for the {{type}} parameter.
> We also probably want to disallow mixed inputs, e.g. a sequence of both dicts 
> and tuples. The user should use only one idiom at a time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356889#comment-16356889
 ] 

Uwe L. Korn commented on ARROW-1021:


{{.pxi}} files are not meant to be used directly. They all render into 
{{pyarrow.lib}} (see the includes in {{pyarrow/lib.pyx}})

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356896#comment-16356896
 ] 

Antoine Pitrou commented on ARROW-1021:
---

Thanks. So, IIUC, 3rd party Cython code is expected to use only the symbols 
defined as {{cdef public}} in {{lib.pxd}}?

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created ARROW-2114:
--

 Summary: [Python] Pull latest docker manylinux1 image
 Key: ARROW-2114
 URL: https://issues.apache.org/jira/browse/ARROW-2114
 Project: Apache Arrow
  Issue Type: Task
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356908#comment-16356908
 ] 

Uwe L. Korn commented on ARROW-2114:


[~wesmckinn] These changes are minimal and only an artifact of the docker 
maintenance. Are you ok when in future I don't make tickets for them? (They 
shouldn't show up in the changelog)

> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356909#comment-16356909
 ] 

ASF GitHub Bot commented on ARROW-2114:
---

xhochy opened a new pull request #1577: ARROW-2114: [Python] Pull latest docker 
manylinux1 image [skip appveyor]
URL: https://github.com/apache/arrow/pull/1577
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2114:
--
Labels: pull-request-available  (was: )

> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356922#comment-16356922
 ] 

Antoine Pitrou commented on ARROW-1021:
---

I've tried to add a test for the Cython API: 
[https://github.com/apache/arrow/pull/1576/files#diff-8dbd260ac34efe0c510155d2a86c1405]
Does that reflect the intended idiom for calling into that API?

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356935#comment-16356935
 ] 

Uwe L. Korn commented on ARROW-1021:


{quote}So, IIUC, 3rd party Cython code is expected to use only the symbols 
defined as {{cdef public}} in {{lib.pxd}}?
{quote}
Yes.
{quote}Does that reflect the intended idiom for calling into that API?
{quote}
Also yes but until now I have only used that API with {{boost::python}} and 
{{pybind11}}. I will add that afterwards to the documentation.

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356955#comment-16356955
 ] 

Antoine Pitrou commented on ARROW-1021:
---

Note it is currently required to also add the Numpy C include path:
https://travis-ci.org/pitrou/arrow/jobs/338970086#L3616-L3623

{code}
In file included from pyarrow_cython_example.cpp:571:
In file included from 
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/api.h:22:
In file included from 
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/arrow_to_python.h:27:
In file included from 
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/python_to_arrow.h:26:
In file included from 
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/common.h:23:
In file included from 
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/config.h:23:
/Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/numpy_interop.h:23:10:
 fatal error: 'numpy/numpyconfig.h' file not found
#include 
{code}


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1501) [JS] JavaScript integration tests

2018-02-08 Thread Brian Hulette (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357061#comment-16357061
 ] 

Brian Hulette commented on ARROW-1501:
--

[~wesmckinn] the integration tests still only test our ability to consume arrow 
data with JS, so we may want to keep this open until we have a JS writer we can 
use. I'll create some more issues to track that side of things

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2115) [JS] Test arrow data production in integration test

2018-02-08 Thread Brian Hulette (JIRA)

Brian Hulette created ARROW-2115:


 Summary: [JS] Test arrow data production in integration test
 Key: ARROW-2115
 URL: https://issues.apache.org/jira/browse/ARROW-2115
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette


Currently the integration tests only treat the JS implementation as a consumer, 
and we also need to test its ability to produce arrow data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2115) [JS] Test arrow data production in integration test

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-2115:
-
Issue Type: Improvement  (was: Bug)

> [JS] Test arrow data production in integration test
> ---
>
> Key: ARROW-2115
> URL: https://issues.apache.org/jira/browse/ARROW-2115
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>
> Currently the integration tests only treat the JS implementation as a 
> consumer, and we also need to test its ability to produce arrow data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2116) [JS] Implement IPC writer

2018-02-08 Thread Brian Hulette (JIRA)

Brian Hulette created ARROW-2116:


 Summary: [JS] Implement IPC writer
 Key: ARROW-2116
 URL: https://issues.apache.org/jira/browse/ARROW-2116
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Brian Hulette






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2116) [JS] Implement IPC writer

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-2116:
-
Issue Type: Improvement  (was: Bug)

> [JS] Implement IPC writer
> -
>
> Key: ARROW-2116
> URL: https://issues.apache.org/jira/browse/ARROW-2116
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2116) [JS] Implement IPC writer

2018-02-08 Thread Brian Hulette (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357088#comment-16357088
 ] 

Brian Hulette commented on ARROW-2116:
--

[~paul.e.taylor] didn't you work on a JS writer?

> [JS] Implement IPC writer
> -
>
> Key: ARROW-2116
> URL: https://issues.apache.org/jira/browse/ARROW-2116
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357119#comment-16357119
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, 
and interops w/ other systems
URL: https://github.com/apache/arrow/pull/1575#issuecomment-364159244
 
 
   We don't yet have a place (outside `format/`) for language-independent or 
cross-language documentation. This would be very helpful to get set up if we 
can agree as a community what tool to use to build this documentation


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357121#comment-16357121
 ] 

Wes McKinney commented on ARROW-2114:
-

Sounds good to me, no need to create JIRAs for Docker image maintenance

> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357155#comment-16357155
 ] 

ASF GitHub Bot commented on ARROW-1021:
---

pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ 
pyarrow API
URL: https://github.com/apache/arrow/pull/1576#issuecomment-364164874
 
 
   This required a bit more churn than I expected (especially to get the Cython 
example and test to work). I think this is ready for review now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357208#comment-16357208
 ] 

ASF GitHub Bot commented on ARROW-1021:
---

pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ 
pyarrow API
URL: https://github.com/apache/arrow/pull/1576#issuecomment-364175414
 
 
   Hmm, there's still an AppVeyor failure. Will try to fix :-/


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357304#comment-16357304
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

wesm commented on a change in pull request #1578: ARROW-1973: [Python] Memory 
leak when converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167015825
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
+  if (base == py_ref_) {
+Py_INCREF(base);
 
 Review comment:
   Another way to handle this would be to put the INCREF in the branch without 
the capsule. Then if `PyArray_SetBaseObject` fails, we decref `base` 
unconditionally (which will either destroy the capsule or reset the `py_ref_` 
ref count to what it was originally)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357303#comment-16357303
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

wesm commented on a change in pull request #1578: ARROW-1973: [Python] Memory 
leak when converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167014864
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
 
 Review comment:
   Can you use C++-style comment with `//`? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1973:
--
Labels: pull-request-available  (was: )

> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357320#comment-16357320
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
+  if (base == py_ref_) {
+Py_INCREF(base);
 
 Review comment:
   True, I can change. Most of the work here was understanding the accounting 
flow :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357323#comment-16357323
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167018073
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
 
 Review comment:
   Yep


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357324#comment-16357324
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

pitrou commented on a change in pull request #1578: ARROW-1973: [Python] Memory 
leak when converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167018125
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -502,18 +502,20 @@ template 
 inline Status ConvertListsLike(PandasOptions options, const 
std::shared_ptr& col,
PyObject** out_values) {
   const ChunkedArray& data = *col->data().get();
-  auto list_type = std::static_pointer_cast(col->type());
+  const auto& list_type = static_cast(*col->type());
 
   // Get column of underlying value arrays
   std::vector> value_arrays;
   for (int c = 0; c < data.num_chunks(); c++) {
-auto arr = std::static_pointer_cast(data.chunk(c));
-value_arrays.emplace_back(arr->values());
+const auto& arr = static_cast(*data.chunk(c));
+value_arrays.emplace_back(arr.values());
   }
-  auto flat_column = std::make_shared(list_type->value_field(), 
value_arrays);
+  auto flat_column = std::make_shared(list_type.value_field(), 
value_arrays);
   // TODO(ARROW-489): Currently we don't have a Python reference for single 
columns.
   //Storing a reference to the whole Array would be to expensive.
-  PyObject* numpy_array;
+  OwnedRef owned_numpy_array;
 
 Review comment:
   This one doesn't seem used. By passing `&numpy_array` below you're not 
changing the internal pointer. Perhaps use `ref()` instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357319#comment-16357319
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
+  if (base == py_ref_) {
+Py_INCREF(base);
 
 Review comment:
   True, I can change. Most of the work here was understanding the accounting 
flow here :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357322#comment-16357322
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167018073
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
   // Error occurred, trust that SetBaseObject set the error state
   return Status::OK();
 } else {
-  // PyArray_SetBaseObject steals our reference to base
-  Py_INCREF(base);
+  /*
+   * See ARROW-1973 for the original memory leak report.
+   *
+   * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+   *
+   *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  In this case, the stolen reference must not be incremented since 
nothing
+   *  outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+   *  it. If we increment this, then we have a memory leak.
+   *
+   *
+   *  Here's an example of how memory can be leaked when converting an 
arrow Array
+   *  of List.to a numpy array
+   *
+   *  1. Create a 1D numpy that is the flattened arrow array.
+   *
+   * There's nothing outside of the serializer that owns this new 
numpy array.
+   *
+   *  2. Make a capsule for the base array.
+   *
+   * The reference count of base is 1.
+   *
+   *  3. Call PyArray_SetBaseObject(arr_, base)
+   *
+   * The reference count is still 1, because the reference is 
stolen.
+   *
+   *  4. Increment the reference count of base (unconditionally)
+   *
+   * The reference count is now 2. This is okay if there's an 
object holding
+   * another reference. The PyArrayObject that stole the reference 
will
+   * eventually decrement the reference count, which will leaves 
us with a
+   * refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+   * ensues.
+   *
+   *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+   *  constructor)
+   *
+   *  This case is simpler. We assume that the reference accounting is 
correct
+   *  coming in. We need to preserve that accounting knowing that the
+   *  PyArrayObject that stole the reference will eventually decref 
it, thus we
+   *  increment the reference count.
+   */
 
 Review comment:
   Ye


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357330#comment-16357330
 ] 

ASF GitHub Bot commented on ARROW-1021:
---

pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ 
pyarrow API
URL: https://github.com/apache/arrow/pull/1576#issuecomment-364197384
 
 
   Ok, I'm afraid I don't know how to get the Windows Cython test to work. Here 
is the log:
   ```
   pyarrow_cython_example.obj : error LNK2001: unresolved external symbol 
"__declspec(dllimport) public: __int64 __cdecl arrow::Array::length(void)const 
" (__imp_?length@Array@arrow@@QEBA_JXZ)
   
C:\Users\appveyor\AppData\Local\Temp\1\pytest-of-appveyor\pytest-0\test_cython_api0\pyarrow_cython_example.cp35-win_amd64.pyd
 : fatal error LNK1120: 1 unresolved externals
   ```
   (from 
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.60/job/aruj4pno67s4xpcf#L6242
 )


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

2018-02-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357337#comment-16357337
 ] 

Wes McKinney commented on ARROW-2113:
-

[~michal.danko] as far as I understand the issue, this does not have to do with 
pyarrow in particular, it is a problem with the system configuration for using 
libhdfs, which is out of our control. 

We are loading {{libjvm}} and {{libhdfs}} at runtime and leaving it to 
{{libhdfs}} to initialize the JVM and load the relevant HDFS client JARs, which 
it is evidently having some trouble with the {{CLASSPATH}}. You should be able 
to reproduce the issue from a standalone C program that uses libhdfs to connect 
to the cluster. 

Could you perhaps seek counsel from the Apache Hadoop community?

> [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS 
> connection failed"
> 
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"

2018-02-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357341#comment-16357341
 ] 

Wes McKinney commented on ARROW-2113:
-

I actually just remembered that we are setting that classpath from the output 
of {{hadoop --classpath}}, see:

https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L116

So the reason that this is failing in the first instance is that {{hadoop}} is 
in the path, whereas in the second, it is setting the correct classpath. Either 
way the CLASSPATH you have set does not appear to have the requisite JAR files

It seems we should be more specific about detecting that Hadoop JARs are in the 
path. I will open a new bug report about this

> [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS 
> connection failed"
> 
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic

2018-02-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2113:

Summary: [Python] Incomplete CLASSPATH with "hadoop" contained in it can 
fool the classpath setting HDFS logic  (was: [Python] Connect to hdfs failing 
with "pyarrow.lib.ArrowIOError: HDFS connection failed")

> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the 
> classpath setting HDFS logic
> -
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
> Fix For: 0.9.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic

2018-02-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2113:

Fix Version/s: 0.9.0

> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the 
> classpath setting HDFS logic
> -
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
> Fix For: 0.9.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic

2018-02-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357345#comment-16357345
 ] 

Wes McKinney commented on ARROW-2113:
-

I renamed this JIRA to reflect the issue. If someone could submit a patch that 
would be very helpful

> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the 
> classpath setting HDFS logic
> -
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
> Fix For: 0.9.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357411#comment-16357411
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167034173
 
 

 ##
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##
 @@ -502,18 +502,20 @@ template 
 inline Status ConvertListsLike(PandasOptions options, const 
std::shared_ptr& col,
PyObject** out_values) {
   const ChunkedArray& data = *col->data().get();
-  auto list_type = std::static_pointer_cast(col->type());
+  const auto& list_type = static_cast(*col->type());
 
   // Get column of underlying value arrays
   std::vector> value_arrays;
   for (int c = 0; c < data.num_chunks(); c++) {
-auto arr = std::static_pointer_cast(data.chunk(c));
-value_arrays.emplace_back(arr->values());
+const auto& arr = static_cast(*data.chunk(c));
+value_arrays.emplace_back(arr.values());
   }
-  auto flat_column = std::make_shared(list_type->value_field(), 
value_arrays);
+  auto flat_column = std::make_shared(list_type.value_field(), 
value_arrays);
   // TODO(ARROW-489): Currently we don't have a Python reference for single 
columns.
   //Storing a reference to the whole Array would be to expensive.
-  PyObject* numpy_array;
+  OwnedRef owned_numpy_array;
 
 Review comment:
   Yep, thank you.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357422#comment-16357422
 ] 

ASF GitHub Bot commented on ARROW-2114:
---

wesm closed pull request #1577: ARROW-2114: [Python] Pull latest docker 
manylinux1 image [skip appveyor]
URL: https://github.com/apache/arrow/pull/1577
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/manylinux1/Dockerfile-x86_64 
b/python/manylinux1/Dockerfile-x86_64
index 1ade9ab10..919a32be7 100644
--- a/python/manylinux1/Dockerfile-x86_64
+++ b/python/manylinux1/Dockerfile-x86_64
@@ -14,7 +14,7 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
-FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-2087
+FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:latest
 
 ADD arrow /arrow
 WORKDIR /arrow/cpp


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (ARROW-2114) [Python] Pull latest docker manylinux1 image

2018-02-08 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2114.
-
Resolution: Fixed

Issue resolved by pull request 1577
[https://github.com/apache/arrow/pull/1577]

> [Python] Pull latest docker manylinux1 image
> 
>
> Key: ARROW-2114
> URL: https://issues.apache.org/jira/browse/ARROW-2114
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357436#comment-16357436
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167037985
 
 

 ##
 File path: cpp/src/arrow/table-test.cc
 ##
 @@ -588,6 +588,101 @@ TEST_F(TestRecordBatch, Slice) {
   }
 }
 
+TEST_F(TestRecordBatch, AddColumn) {
+  const int length = 10;
+
+  auto field1 = field("f1", int32());
+  auto field2 = field("f2", uint8());
+  auto field3 = field("f3", int16());
+
+  auto schema1 = ::arrow::schema({field1, field2});
+  auto schema2 = ::arrow::schema({field2, field3});
+  auto schema3 = ::arrow::schema({field2});
+
+  auto array1 = MakeRandomArray(length);
+  auto array2 = MakeRandomArray(length);
+  auto array3 = MakeRandomArray(length);
+
+  auto batch1 = RecordBatch::Make(schema1, length, {array1, array2});
+  auto batch2 = RecordBatch::Make(schema2, length, {array2, array3});
+  auto batch3 = RecordBatch::Make(schema3, length, {array2});
+
+  const RecordBatch& batch = *batch3;
+  std::shared_ptr result;
+
+  // Negative tests with invalid index
+  Status status = batch.AddColumn(5, field1, array1->data(), &result);
+  ASSERT_TRUE(status.IsInvalid());
+  status = batch.AddColumn(-1, field1, array1->data(), &result);
+  ASSERT_TRUE(status.IsInvalid());
+
+  // Negative test with wrong length
+  auto longer_col = MakeRandomArray(length + 1);
+  status = batch.AddColumn(0, field1, longer_col->data(), &result);
+  ASSERT_TRUE(status.IsInvalid());
+
+  // Negative test with mismatch type
+  status = batch.AddColumn(0, field1, array2->data(), &result);
+  ASSERT_TRUE(status.IsInvalid());
+
+  ASSERT_OK(batch.AddColumn(0, field1, array1->data(), &result));
+  ASSERT_TRUE(result->Equals(*batch1));
+
+  ASSERT_OK(batch.AddColumn(1, field3, array3->data(), &result));
+  ASSERT_TRUE(result->Equals(*batch2));
+}
+
+TEST_F(TestRecordBatch, RemoveColumn) {
+  const int length = 10;
+
+  auto field1 = field("f1", int32());
+  auto field2 = field("f2", uint8());
+  auto field3 = field("f3", int16());
+
+  auto schema1 = ::arrow::schema({field1, field2, field3});
+  auto schema2 = ::arrow::schema({field2, field3});
+  auto schema3 = ::arrow::schema({field1, field3});
+  auto schema4 = ::arrow::schema({field1, field2});
+
+  auto array1 = MakeRandomArray(length);
+  auto array2 = MakeRandomArray(length);
+  auto array3 = MakeRandomArray(length);
+
+  auto batch1 = RecordBatch::Make(schema1, length, {array1, array2, array3});
+  auto batch2 = RecordBatch::Make(schema2, length, {array2, array3});
+  auto batch3 = RecordBatch::Make(schema3, length, {array1, array3});
+  auto batch4 = RecordBatch::Make(schema4, length, {array1, array2});
+
+  const RecordBatch& batch = *batch1;
+  std::shared_ptr result;
+
+  ASSERT_OK(batch.RemoveColumn(0, &result));
+  ASSERT_TRUE(result->Equals(*batch2));
+
+  ASSERT_OK(batch.RemoveColumn(1, &result));
+  ASSERT_TRUE(result->Equals(*batch3));
+
+  ASSERT_OK(batch.RemoveColumn(2, &result));
+  ASSERT_TRUE(result->Equals(*batch4));
 
 Review comment:
   Add a test for removing an out of bounds index


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357437#comment-16357437
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167037093
 
 

 ##
 File path: cpp/src/arrow/record_batch.cc
 ##
 @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch {
 
   std::shared_ptr column_data(int i) const override { return 
columns_[i]; }
 
+  Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
 
 Review comment:
   Pass `Array` here instead, since that's more likely to be what the user has 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357434#comment-16357434
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167037375
 
 

 ##
 File path: cpp/src/arrow/record_batch.cc
 ##
 @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch {
 
   std::shared_ptr column_data(int i) const override { return 
columns_[i]; }
 
+  Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
+   std::shared_ptr* out) const override {
+if (i < 0 || i > num_columns() + 1) {
+  return Status::Invalid("Invalid column index");
+}
+if (field == nullptr) {
+  std::stringstream ss;
+  ss << "Field " << i << " was null";
+  return Status::Invalid(ss.str());
+}
+if (column == nullptr) {
+  std::stringstream ss;
+  ss << "Column " << i << " was null";
+  return Status::Invalid(ss.str());
+}
+if (!field->type()->Equals(column->type)) {
+  std::stringstream ss;
+  ss << "Column data type " << field->type()->name()
+ << " does not match field data type " << column->type->name();
+  return Status::Invalid(ss.str());
+}
+if (column->length != num_rows_) {
+  std::stringstream ss;
+  ss << "Added column's length must match record batch's length. Expected 
length "
+ << num_rows_ << " but got length " << column->length;
+  return Status::Invalid(ss.str());
+}
+
+std::shared_ptr new_schema;
+RETURN_NOT_OK(schema_->AddField(i, field, &new_schema));
 
 Review comment:
   We could leave the boundschecking above to `Schema::AddField` -- could you 
also check whether that function has the issues described above?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357431#comment-16357431
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167036590
 
 

 ##
 File path: cpp/src/arrow/record_batch.cc
 ##
 @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch {
 
   std::shared_ptr column_data(int i) const override { return 
columns_[i]; }
 
+  Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
+   std::shared_ptr* out) const override {
+if (i < 0 || i > num_columns() + 1) {
 
 Review comment:
   I think this should be `i > num_columns()`. This is also a bug in 
`SimpleTable::AddColumn`. Can you add a test where `i == num_columns()`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357432#comment-16357432
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167036916
 
 

 ##
 File path: cpp/src/arrow/record_batch.cc
 ##
 @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch {
 
   std::shared_ptr column_data(int i) const override { return 
columns_[i]; }
 
+  Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
+   std::shared_ptr* out) const override {
+if (i < 0 || i > num_columns() + 1) {
+  return Status::Invalid("Invalid column index");
+}
+if (field == nullptr) {
+  std::stringstream ss;
+  ss << "Field " << i << " was null";
+  return Status::Invalid(ss.str());
+}
+if (column == nullptr) {
+  std::stringstream ss;
+  ss << "Column " << i << " was null";
+  return Status::Invalid(ss.str());
+}
 
 Review comment:
   I think these should both be `DCHECK`, since null would indicate a problem 
with application logic, so should be a "can't fail"


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357438#comment-16357438
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167037896
 
 

 ##
 File path: cpp/src/arrow/table-test.cc
 ##
 @@ -588,6 +588,101 @@ TEST_F(TestRecordBatch, Slice) {
   }
 }
 
+TEST_F(TestRecordBatch, AddColumn) {
+  const int length = 10;
+
+  auto field1 = field("f1", int32());
+  auto field2 = field("f2", uint8());
+  auto field3 = field("f3", int16());
+
+  auto schema1 = ::arrow::schema({field1, field2});
+  auto schema2 = ::arrow::schema({field2, field3});
+  auto schema3 = ::arrow::schema({field2});
+
+  auto array1 = MakeRandomArray(length);
+  auto array2 = MakeRandomArray(length);
+  auto array3 = MakeRandomArray(length);
+
+  auto batch1 = RecordBatch::Make(schema1, length, {array1, array2});
+  auto batch2 = RecordBatch::Make(schema2, length, {array2, array3});
+  auto batch3 = RecordBatch::Make(schema3, length, {array2});
+
+  const RecordBatch& batch = *batch3;
+  std::shared_ptr result;
+
+  // Negative tests with invalid index
+  Status status = batch.AddColumn(5, field1, array1->data(), &result);
 
 Review comment:
   Add a test for `batch.AddColumn(2, ...)` to address the edge case in the 
implementation. We probably need a corresponding test for `Table` (and maybe 
also `Schema`). 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357435#comment-16357435
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167037500
 
 

 ##
 File path: cpp/src/arrow/record_batch.h
 ##
 @@ -96,6 +96,14 @@ class ARROW_EXPORT RecordBatch {
   /// \return an internal ArrayData object
   virtual std::shared_ptr column_data(int i) const = 0;
 
+  /// \brief Add column to the record batch, producing a new RecordBatch
+  virtual Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
+   std::shared_ptr* out) const = 0;
+
+  /// \brief Remove column from the record batch, producing a new RecordBatch
+  virtual Status RemoveColumn(int i, std::shared_ptr* out) const 
= 0;
 
 Review comment:
   Can you document the parameters for these functions?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357439#comment-16357439
 ] 

ASF GitHub Bot commented on ARROW-969:
--

wesm commented on a change in pull request #1574: ARROW-969: [C++] Add 
add/remove field functions for RecordBatch
URL: https://github.com/apache/arrow/pull/1574#discussion_r167038477
 
 

 ##
 File path: cpp/src/arrow/record_batch.cc
 ##
 @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch {
 
   std::shared_ptr column_data(int i) const override { return 
columns_[i]; }
 
+  Status AddColumn(int i, const std::shared_ptr& field,
+   const std::shared_ptr& column,
+   std::shared_ptr* out) const override {
+if (i < 0 || i > num_columns() + 1) {
+  return Status::Invalid("Invalid column index");
+}
+if (field == nullptr) {
+  std::stringstream ss;
+  ss << "Field " << i << " was null";
+  return Status::Invalid(ss.str());
+}
+if (column == nullptr) {
+  std::stringstream ss;
+  ss << "Column " << i << " was null";
+  return Status::Invalid(ss.str());
+}
 
 Review comment:
   I took a look at `SimpleTable::AddColumn`; there `col` is being null-checked 
-- I think that should also be a DCHECK


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++/Python] Add add/remove field functions for RecordBatch
> ---
>
> Key: ARROW-969
> URL: https://issues.apache.org/jira/browse/ARROW-969
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Analogous to the Table equivalents



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357447#comment-16357447
 ] 

ASF GitHub Bot commented on ARROW-1942:
---

wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for 
small integers
URL: https://github.com/apache/arrow/pull/1551#issuecomment-364220281
 
 
   @xuepanchen I made the functor changes. Can you add a benchmark for the 
8-bit integer case? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Hash table specializations for small integers
> ---
>
> Key: ARROW-1942
> URL: https://issues.apache.org/jira/browse/ARROW-1942
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There is no need to use a dynamically-sized hash table with uint8, int8, 
> since a fixed-size lookup table can be used and avoid hashing altogether



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1501) [JS] JavaScript integration tests

2018-02-08 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357449#comment-16357449
 ] 

Wes McKinney commented on ARROW-1501:
-

Cool, thanks

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357514#comment-16357514
 ] 

ASF GitHub Bot commented on ARROW-1021:
---

wesm commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ 
pyarrow API
URL: https://github.com/apache/arrow/pull/1576#issuecomment-364235254
 
 
   I can take a look at the Windows issue (I have a machine to test on)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357576#comment-16357576
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

wesm commented on issue #1578: ARROW-1973: [Python] Memory leak when converting 
Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#issuecomment-364249569
 
 
   This needed a clang-format


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357581#comment-16357581
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when 
converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#issuecomment-364250198
 
 
   Hm, okay. I did run that. It's probably because I'm using clang 5


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357584#comment-16357584
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when 
converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#issuecomment-364250409
 
 
   How do we decide when to upgrade? When it's released on ubuntu or some other 
slowish moving distro?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357589#comment-16357589
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

wesm commented on issue #1578: ARROW-1973: [Python] Memory leak when converting 
Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#issuecomment-364251400
 
 
   It looks like LLVM 5 has been promoted to stable (according to 
http://apt.llvm.org/) so I think we should upgrade our pin to clang 5.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2117) [C++] Pin clang to version 5.0

2018-02-08 Thread Phillip Cloud (JIRA)

Phillip Cloud created ARROW-2117:


 Summary: [C++] Pin clang to version 5.0
 Key: ARROW-2117
 URL: https://issues.apache.org/jira/browse/ARROW-2117
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.9.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud


Let's do this after the next release.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357616#comment-16357616
 ] 

ASF GitHub Bot commented on ARROW-1973:
---

cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when 
converting Arrow tables with array columns to Pandas dataframes.
URL: https://github.com/apache/arrow/pull/1578#issuecomment-364257181
 
 
   Opened a JIRA for it: https://issues.apache.org/jira/browse/ARROW-2117


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> --
>
> Key: ARROW-1973
> URL: https://issues.apache.org/jira/browse/ARROW-1973
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-987) [JS] Implement JSON writer for Integration tests

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-987:

Fix Version/s: (was: 0.9.0)
   JS-0.3.0

> [JS] Implement JSON writer for Integration tests
> 
>
> Key: ARROW-987
> URL: https://issues.apache.org/jira/browse/ARROW-987
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Major
> Fix For: JS-0.3.0
>
>
> Rather than storing generated binary files in the repo, we could just run the 
> integration tests on the JS implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1501) [JS] JavaScript integration tests

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-1501:
-
Fix Version/s: (was: 0.9.0)
   JS-0.3.0

> [JS] JavaScript integration tests
> -
>
> Key: ARROW-1501
> URL: https://issues.apache.org/jira/browse/ARROW-1501
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.3.0
>
>
> Tracking JIRA for integration test-related issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-1870:
-
Fix Version/s: (was: 0.9.0)
   JS-0.3.0

> [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
> 
>
> Key: ARROW-1870
> URL: https://issues.apache.org/jira/browse/ARROW-1870
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2044) [JS] Typings should be a regular dependency

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-2044:
-
Fix Version/s: (was: 0.9.0)
   JS-0.3.0

> [JS] Typings should be a regular dependency
> ---
>
> Key: ARROW-2044
> URL: https://issues.apache.org/jira/browse/ARROW-2044
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Minor
>  Labels: pull-request-available
> Fix For: JS-0.3.0
>
>
> Currently some typings ({{@types/node}} and {{@types/flatbuffers}}) are 
> devDependencies rather than dependencies, which prevents {{.d.ts}} files from 
> being understood in downstream projects.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1990) [JS] Add "DataFrame" object

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-1990:
-
Fix Version/s: (was: 0.9.0)
   JS-0.3.0

> [JS] Add "DataFrame" object
> ---
>
> Key: ARROW-1990
> URL: https://issues.apache.org/jira/browse/ARROW-1990
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
> Fix For: JS-0.3.0
>
>
> Add a TypeScript class that can perform optimized dataframe operations on an 
> arrow {{Table}} and/or {{StructVector}}. Initially this should include 
> operations like filtering, counting, and scanning. Eventually this class 
> could include more operations like sorting, count by/group by, etc...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-951) [JS] Add generated API documentation

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-951:

Fix Version/s: JS-0.3.0

> [JS] Add generated API documentation
> 
>
> Key: ARROW-951
> URL: https://issues.apache.org/jira/browse/ARROW-951
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Brian Hulette
>Priority: Minor
>  Labels: documentation
> Fix For: JS-0.3.0
>
>
> Maybe using http://typedoc.org ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file

2018-02-08 Thread Wes McKinney (JIRA)

Wes McKinney created ARROW-2118:
---

 Summary: [Python] Improve error message when calling 
parquet.read_table on an empty file
 Key: ARROW-2118
 URL: https://issues.apache.org/jira/browse/ARROW-2118
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


Currently it raises an exception about memory mapping failing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2119) Handle Arrow stream with zero record batch

2018-02-08 Thread Jingyuan Wang (JIRA)

Jingyuan Wang created ARROW-2119:


 Summary: Handle Arrow stream with zero record batch
 Key: ARROW-2119
 URL: https://issues.apache.org/jira/browse/ARROW-2119
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Jingyuan Wang


It looks like currently many places of the code assume that there needs to be 
at least one record batch for streaming format. Is zero-recordbatch not 
supported by design?

e.g. 
[https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45]
{code:none}
  public static void convert(InputStream in, OutputStream out) throws 
IOException {
BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE);
try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) {
  VectorSchemaRoot root = reader.getVectorSchemaRoot();
  // load the first batch before instantiating the writer so that we have 
any dictionaries
  if (!reader.loadNextBatch()) {
throw new IOException("Unable to read first record batch");
  }
  ...
{code}
Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an 
exception originated from 
[https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:]
{code:none}
Status Table::FromRecordBatches(const 
std::vector>& batches,
std::shared_ptr* table) {
  if (batches.size() == 0) {
return Status::Invalid("Must pass at least one record batch");
  }
  ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-1918) [JS] Integration portion of verify-release-candidate.sh fails

2018-02-08 Thread Brian Hulette (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Hulette updated ARROW-1918:
-
Fix Version/s: JS-0.3.0

> [JS] Integration portion of verify-release-candidate.sh fails
> -
>
> Key: ARROW-1918
> URL: https://issues.apache.org/jira/browse/ARROW-1918
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.8.0
>Reporter: Wes McKinney
>Priority: Major
> Fix For: JS-0.3.0
>
>
> I'm going to temporarily disable this in my fixes in ARROW-1917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357805#comment-16357805
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding 
containerized Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#issuecomment-344361239
 
 
   This is currently a WIP, the Scala/Java tests are able to run
   
   Left TODO:
   
   - [x] Run PySpark Tests
   - [ ] Verify working with docker-compose and existing volumes in arrow/dev
   - [x] Check why Zinc is unable to run in mvn build, need to enable port 3030?
   - [ ] Speed up pyarrow build using conda prefix as toolchain


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357806#comment-16357806
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding 
containerized Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#issuecomment-364306888
 
 
   Ok, I finally got this to build all and pass all tests!  There are still a 
couple of issues to work out though, I'll discuss below..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357808#comment-16357808
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding 
containerized Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#issuecomment-364306888
 
 
   Ok, I finally got this to build all and pass all tests!  There are still a 
couple of issues to work out though, I'll discuss below..
   
   Btw, to get the correct `pyarrow.__version__` from the dev env, you do need 
to have all git tags fetched and install `setuptools_scm` from pip or conda.  
@xhochy , `setuptools_scm` wasn't listed in any of the developer docs I could 
find, should it be added to the list of dependent packages for setting up a 
conda env?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357810#comment-16357810
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on a change in pull request #1319: [WIP] ARROW-1579: 
[Java] Adding containerized Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#discussion_r167120268
 
 

 ##
 File path: python/pyarrow/__init__.py
 ##
 @@ -24,7 +24,7 @@
# package is not installed
 try:
 import setuptools_scm
-__version__ = setuptools_scm.get_version('../')
+__version__ = setuptools_scm.get_version(root='../../', 
relative_to=__file__)
 
 Review comment:
   @xhochy and @wesm , I needed to change this because it would only give a 
version if run under ARROW_HOME/python directory.  So when running Spark tests, 
on importing pyarrow it would return `None`.  Making it relative to the 
`__file__` seemed to fix it for all cases.  I can make this a separate JIRA if 
you think that would be better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357819#comment-16357819
 ] 

ASF GitHub Bot commented on ARROW-1579:
---

BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding 
containerized Spark Integration tests
URL: https://github.com/apache/arrow/pull/1319#issuecomment-364309234
 
 
   @xhochy , I could not get Arrow C++ to build with `export 
ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX`, I would get a linking error with gflags 
like "undefined reference google::FlagRegisterer::FlagRegisterer".  I thought 
maybe it was because I wasn't using g++ 4.9, but I had no luck trying to get 
4.9 installed since the base image I'm using is Ubuntu 16.04.  Have you ever 
run into this?  It seemed like it was some kind of template constructor that it 
couldn't find..


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Add dockerized test setup to validate Spark integration
> --
>
> Key: ARROW-1579
> URL: https://issues.apache.org/jira/browse/ARROW-1579
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> cc [~bryanc] -- the goal of this will be to validate master-to-master to 
> catch any regressions in the Spark integration



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties

2018-02-08 Thread rip.nsk (JIRA)

rip.nsk created ARROW-2120:
--

 Summary: Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for 
Thirdparties
 Key: ARROW-2120
 URL: https://issues.apache.org/jira/browse/ARROW-2120
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: rip.nsk
Assignee: rip.nsk






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357833#comment-16357833
 ] 

ASF GitHub Bot commented on ARROW-2120:
---

rip-nsk opened a new pull request #1580: ARROW-2120: [C++] Add possibility to 
use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
URL: https://github.com/apache/arrow/pull/1580
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
> -
>
> Key: ARROW-2120
> URL: https://issues.apache.org/jira/browse/ARROW-2120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: rip.nsk
>Assignee: rip.nsk
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties

2018-02-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2120:
--
Labels: pull-request-available  (was: )

> Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
> -
>
> Key: ARROW-2120
> URL: https://issues.apache.org/jira/browse/ARROW-2120
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: rip.nsk
>Assignee: rip.nsk
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2121:
---

 Summary: Consider special casing object arrays in pandas 
serializers.
 Key: ARROW-2121
 URL: https://issues.apache.org/jira/browse/ARROW-2121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2121:
--
Labels: pull-request-available  (was: )

> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357951#comment-16357951
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

robertnishihara opened a new pull request #1581: [WIP] ARROW-2121: [Python] 
Handle object arrays directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581
 
 
   The goal here is to get the best of both the `pandas_serialization_context` 
(speed at serializing pandas dataframes containing strings and other objects) 
and the `default_serialization_context` (correctly serializing a large class of 
numpy object arrays).
   
   This PR sort of messes up the function 
`pa.pandas_compat.dataframe_to_serialized_dict`. Is that function just a helper 
function for implementing the custom pandas serializers? Or is it intended to 
be used in other places.
   
   TODO in this PR (assuming you think this approach is reasonable):
   
   - [ ] remove `pandas_serialization_context`
   - [ ] make sure this code path is tested
   - [ ] double check that performance is good
   
   cc @wesm @pcmoritz @devin-petersohn 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357967#comment-16357967
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

wesm commented on issue #1581: [WIP] ARROW-2121: [Python] Handle object arrays 
directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#issuecomment-364344672
 
 
   Well, we need to preserve the zero-copy pandas reads. Now that our ASV 
benchmarking setup has been rehabilitated we should be able to do that in this 
patch to verify performance


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357976#comment-16357976
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

robertnishihara commented on a change in pull request #1581: [WIP] ARROW-2121: 
[Python] Handle object arrays directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#discussion_r167148817
 
 

 ##
 File path: python/pyarrow/pandas_compat.py
 ##
 @@ -421,11 +421,16 @@ def dataframe_to_serialized_dict(frame):
 block_data.update(dictionary=values.categories,
   ordered=values.ordered)
 values = values.codes
-
 block_data.update(
 placement=block.mgr_locs.as_array,
 block=values
 )
+
+# If we are dealing with an object array, pickle it instead.
+if isinstance(block, _int.ObjectBlock):
+block_data['object'] = None
+block_data['block'] = builtin_pickle.dumps(values)
 
 Review comment:
   Should we be using `_pickle_to_buffer` here? Does that make a difference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358038#comment-16358038
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

robertnishihara commented on a change in pull request #1581: ARROW-2121: 
[Python] Handle object arrays directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#discussion_r167156435
 
 

 ##
 File path: python/pyarrow/pandas_compat.py
 ##
 @@ -421,11 +421,18 @@ def dataframe_to_serialized_dict(frame):
 block_data.update(dictionary=values.categories,
   ordered=values.ordered)
 values = values.codes
-
 block_data.update(
 placement=block.mgr_locs.as_array,
 block=values
 )
+
+# If we are dealing with an object array, pickle it instead. Note that
+# we do not use isinstance here because _int.CategoricalBlock is a
+# subclass of _int.ObjectBlock.
+if type(block) == _int.ObjectBlock:
+block_data['object'] = None
+block_data['block'] = builtin_pickle.dumps(values)
 
 Review comment:
   Should we be using `_pickle_to_buffer` here? Does that make a difference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.

2018-02-08 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358037#comment-16358037
 ] 

ASF GitHub Bot commented on ARROW-2121:
---

robertnishihara commented on a change in pull request #1581: ARROW-2121: 
[Python] Handle object arrays directly in pandas serializer.
URL: https://github.com/apache/arrow/pull/1581#discussion_r167148817
 
 

 ##
 File path: python/pyarrow/pandas_compat.py
 ##
 @@ -421,11 +421,16 @@ def dataframe_to_serialized_dict(frame):
 block_data.update(dictionary=values.categories,
   ordered=values.ordered)
 values = values.codes
-
 block_data.update(
 placement=block.mgr_locs.as_array,
 block=values
 )
+
+# If we are dealing with an object array, pickle it instead.
+if isinstance(block, _int.ObjectBlock):
+block_data['object'] = None
+block_data['block'] = builtin_pickle.dumps(values)
 
 Review comment:
   Should we be using `_pickle_to_buffer` here? Does that make a difference?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consider special casing object arrays in pandas serializers.
> 
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.

2018-02-08 Thread Robert Nishihara (JIRA)

Robert Nishihara created ARROW-2122:
---

 Summary: Pyarrow fails to serialize dataframe with timestamp.
 Key: ARROW-2122
 URL: https://issues.apache.org/jira/browse/ARROW-2122
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Robert Nishihara


The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd


s = pa.serialize({code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.

2018-02-08 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara updated ARROW-2122:

Description: 
The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd

df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 

s = pa.serialize(df).to_buffer()
new_df = pa.deserialize(s) # this fails{code}
The last line fails with
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File "serialization.pxi", line 441, in pyarrow.lib.deserialize
  File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
  File "serialization.pxi", line 257, in 
pyarrow.lib.SerializedPyObject.deserialize
  File "serialization.pxi", line 174, in 
pyarrow.lib.SerializationContext._deserialize_callback
  File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
_deserialize_pandas_dataframe
    return pdcompat.serialized_dict_to_dataframe(data)
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
serialized_dict_to_dataframe
    for block in data['blocks']]
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 

    for block in data['blocks']]
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
_reconstruct_block
    dtype = _make_datetimetz(item['timezone'])
  File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
_make_datetimetz
    return DatetimeTZDtype('ns', tz=tz)
  File 
"/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
 line 409, in __new__
    raise ValueError("DatetimeTZDtype constructor must have a tz "
ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
 

  was:
The bug can be reproduced as follows.
{code:java}
import pyarrow as pa
import pandas as pd


s = pa.serialize({code}


> Pyarrow fails to serialize dataframe with timestamp.
> 
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

98 matches

Mail list logo