[jira] [Commented] (ARROW-2083) Support skipping builds
[ https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356639#comment-16356639 ] ASF GitHub Bot commented on ARROW-2083: --- pitrou commented on a change in pull request #1568: ARROW-2083: [CI] Detect changed components on Travis-CI URL: https://github.com/apache/arrow/pull/1568#discussion_r166860298 ## File path: .travis.yml ## @@ -61,94 +63,110 @@ matrix: - export CXX="clang++-4.0" - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh - $TRAVIS_BUILD_DIR/ci/travis_lint.sh -- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7 -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6 +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi + # [OS X] C++ & Python w/ XCode 6.4 - compiler: clang language: cpp osx_image: xcode6.4 os: osx cache: addons: before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` - export ARROW_TRAVIS_USE_TOOLCHAIN=1 - export ARROW_TRAVIS_PLASMA=1 - export ARROW_TRAVIS_ORC=1 - export ARROW_BUILD_WARNING_LEVEL=CHECKIN -- travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7 -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6 +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi + # [manylinux1] Python - language: cpp before_script: -- docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh; fi + # Java w/ OpenJDK 7 - language: java os: linux jdk: openjdk7 +before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` script: -- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh +- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi +- if [ $ARROW_CI_SITE_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_site.sh; fi Review comment: Oh, I see. I thought `mvn site` rebult the whole Web site. Apparently I was mistaken. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support skipping builds > --- > > Key: ARROW-2083 > URL: https://issues.apache.org/jira/browse/ARROW-2083 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > While appveyor supports a [skip appveyor] you cannot skip only travis. What > is the feeling about adding e.g. > [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] > to our build. We could also do some simple kind of change detection that we > don't build the C++/Python parts and only Java and the integration tests if > there was a change in the PR that only affects Java. > I think it might be worthwhile to spend a bit on that to get a bit of load of > the CI infrastructure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2083) Support skipping builds
[ https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356642#comment-16356642 ] ASF GitHub Bot commented on ARROW-2083: --- pitrou commented on issue #1568: ARROW-2083: [CI] Detect changed components on Travis-CI URL: https://github.com/apache/arrow/pull/1568#issuecomment-364039703 Ok, I addressed the review comment and rebased. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support skipping builds > --- > > Key: ARROW-2083 > URL: https://issues.apache.org/jira/browse/ARROW-2083 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > While appveyor supports a [skip appveyor] you cannot skip only travis. What > is the feeling about adding e.g. > [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] > to our build. We could also do some simple kind of change detection that we > don't build the C++/Python parts and only Java and the integration tests if > there was a change in the PR that only affects Java. > I think it might be worthwhile to spend a bit on that to get a bit of load of > the CI infrastructure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis
[ https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1632: --- Fix Version/s: (was: 0.10.0) 0.9.0 > [Python] Permit categorical conversions in Table.to_pandas on a per-column > basis > > > Key: ARROW-1632 > URL: https://issues.apache.org/jira/browse/ARROW-1632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > > Currently this is all or nothing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis
[ https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn reassigned ARROW-1632: -- Assignee: Uwe L. Korn > [Python] Permit categorical conversions in Table.to_pandas on a per-column > basis > > > Key: ARROW-1632 > URL: https://issues.apache.org/jira/browse/ARROW-1632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > > Currently this is all or nothing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1632) [Python] Permit categorical conversions in Table.to_pandas on a per-column basis
[ https://issues.apache.org/jira/browse/ARROW-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356687#comment-16356687 ] Uwe L. Korn commented on ARROW-1632: Keeping this in 0.9 for now, I will have a look if I can get it done in time. > [Python] Permit categorical conversions in Table.to_pandas on a per-column > basis > > > Key: ARROW-1632 > URL: https://issues.apache.org/jira/browse/ARROW-1632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > > Currently this is all or nothing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1975) [C++] Add abi-compliance-checker to build process
[ https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356693#comment-16356693 ] Uwe L. Korn commented on ARROW-1975: [~wesmckinn] yes, checking this automatically would save me from quite some follow-up work on releases. > [C++] Add abi-compliance-checker to build process > - > > Key: ARROW-1975 > URL: https://issues.apache.org/jira/browse/ARROW-1975 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > > I would like to check our baseline modules with > https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades > are much smoother and that we don‘t break the ABI in patch releases. > As we‘re pre-1.0 yet, I accept that there will be breakage but I would like > to keep them to a minimum. Currently the biggest pain with Arrow is you need > to pin it in Python always with {{==0.x.y}}, otherwise segfaults are > inevitable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
Michal Danko created ARROW-2113: --- Summary: [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed" Key: ARROW-2113 URL: https://issues.apache.org/jira/browse/ARROW-2113 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 5.13.1 Reporter: Michal Danko Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "", line 1, in File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed - export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar), Because of this issue, our customer currently can't use pyarrow lib for oozie workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2083) Support skipping builds
[ https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356750#comment-16356750 ] ASF GitHub Bot commented on ARROW-2083: --- xhochy closed pull request #1568: ARROW-2083: [CI] Detect changed components on Travis-CI URL: https://github.com/apache/arrow/pull/1568 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.travis.yml b/.travis.yml index 58d6786aa..d591a9922 100644 --- a/.travis.yml +++ b/.travis.yml @@ -46,11 +46,13 @@ matrix: allow_failures: - jdk: oraclejdk9 include: + # C++ & Python w/ clang 4.0 - compiler: gcc language: cpp os: linux group: deprecated before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` - export ARROW_TRAVIS_USE_TOOLCHAIN=1 - export ARROW_TRAVIS_VALGRIND=1 - export ARROW_TRAVIS_PLASMA=1 @@ -61,12 +63,13 @@ matrix: - export CXX="clang++-4.0" - $TRAVIS_BUILD_DIR/ci/travis_install_clang_tools.sh - $TRAVIS_BUILD_DIR/ci/travis_lint.sh -- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7 -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6 +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi + # [OS X] C++ & Python w/ XCode 6.4 - compiler: clang language: cpp osx_image: xcode6.4 @@ -74,81 +77,96 @@ matrix: cache: addons: before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` - export ARROW_TRAVIS_USE_TOOLCHAIN=1 - export ARROW_TRAVIS_PLASMA=1 - export ARROW_TRAVIS_ORC=1 - export ARROW_BUILD_WARNING_LEVEL=CHECKIN -- travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then travis_wait 50 $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7 -- $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6 +- if [ $ARROW_CI_CPP_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_build_parquet_cpp.sh; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 2.7; fi +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6; fi + # [manylinux1] Python - language: cpp before_script: -- docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest; fi script: -- $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh +- if [ $ARROW_CI_PYTHON_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh; fi + # Java w/ OpenJDK 7 - language: java os: linux jdk: openjdk7 +before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` script: -- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh +- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi +- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_javadoc.sh; fi + # Java w/ Oracle JDK 9 - language: java os: linux -env: ARROW_TRAVIS_SKIP_SITE=yes jdk: oraclejdk9 +before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` script: -- $TRAVIS_BUILD_DIR/ci/travis_script_java.sh +- if [ $ARROW_CI_JAVA_AFFECTED == "1" ]; then $TRAVIS_BUILD_DIR/ci/travis_script_java.sh; fi addons: apt: packages: - oracle-java9-installer + # Integration w/ OpenJDK 8 - language: java os: linux env: ARROW_TEST_GROUP=integration jdk: openjdk8 before_script: +- eval `python $TRAVIS_BUILD_DIR/ci/travis_detect_changes.py` - source $TRAVIS_BUILD_DIR/ci/t
[jira] [Resolved] (ARROW-2083) Support skipping builds
[ https://issues.apache.org/jira/browse/ARROW-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-2083. Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1568 [https://github.com/apache/arrow/pull/1568] > Support skipping builds > --- > > Key: ARROW-2083 > URL: https://issues.apache.org/jira/browse/ARROW-2083 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Uwe L. Korn >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While appveyor supports a [skip appveyor] you cannot skip only travis. What > is the feeling about adding e.g. > [https://github.com/travis-ci/travis-ci/issues/5032#issuecomment-273626567] > to our build. We could also do some simple kind of change detection that we > don't build the C++/Python parts and only Java and the integration tests if > there was a change in the PR that only affects Java. > I think it might be worthwhile to spend a bit on that to get a bit of load of > the CI infrastructure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356752#comment-16356752 ] ASF GitHub Bot commented on ARROW-1425: --- xhochy commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, and interops w/ other systems URL: https://github.com/apache/arrow/pull/1575#issuecomment-364062240 @icexelloss @wesm Keep it in Python for now. In future, we should merge all documentations into a single sphinx setup. As long as we have not done this, Python is a good default place as it is already on sphinx as well as currently the most detailed documentation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356767#comment-16356767 ] ASF GitHub Bot commented on ARROW-1425: --- ts-dpb commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, and interops w/ other systems URL: https://github.com/apache/arrow/pull/1575#issuecomment-364069670 It was puzzling to the author and me where to place the new piece of documentation – we looked for a top-level doc directory but there was none. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356768#comment-16356768 ] Antoine Pitrou commented on ARROW-1021: --- What is the status of {{arrow/python/api.h}}? It looks more like an internal helper compared to {{arrow/python/pyarrow.h}}. > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1021: -- Labels: pull-request-available (was: ) > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356785#comment-16356785 ] ASF GitHub Bot commented on ARROW-1021: --- pitrou opened a new pull request #1576: ARROW-1021: [Python] Add documentation for C++ pyarrow API URL: https://github.com/apache/arrow/pull/1576 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356809#comment-16356809 ] Antoine Pitrou commented on ARROW-1021: --- By the way, what's the intended use of {{pyarrow/public-api.pxi}}? The hyphen makes it non-cimportable: {code} Error compiling Cython file: ... from pyarrow.public-api cimport * ^ ttt.pyx:2:19: Expected 'import' or 'cimport' {code} > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Danko updated ARROW-2113: Description: Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar (this part depends on your configuration you need those values for pyarrow.hdfs to work:) (path to libjvm:) (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) (path to libhdfs:) (export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "", line 1, in File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed - export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar), Because of this issue, our customer currently can't use pyarrow lib for oozie workflows. was: Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "", line 1, in File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed - export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar), Because of this issue, our customer currently can't use pyarrow lib for oozie workflows. > [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS > connection failed" > -
[jira] [Updated] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Danko updated ARROW-2113: Description: Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar (this part depends on your configuration you need those values for pyarrow.hdfs to work: ) (path to libjvm: ) (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) (path to libhdfs: ) (export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "", line 1, in File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed - export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random, it is path copied from Cloudera distribution of Hadoop (original file was zookeeper.jar), Because of this issue, our customer currently can't use pyarrow lib for oozie workflows. was: Steps to replicate the issue: mkdir /tmp/test cd /tmp/test mkdir jars cd jars touch test1.jar mkdir -p ../lib/zookeeper cd ../lib/zookeeper ln -s ../../jars/test1.jar ./test1.jar ln -s test1.jar test.jar mkdir -p ../hadoop/lib cd ../hadoop/lib ln -s ../../../lib/zookeeper/test.jar ./test.jar (this part depends on your configuration you need those values for pyarrow.hdfs to work:) (path to libjvm:) (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) (path to libhdfs:) (export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Ends with error: loadFileSystems error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, kerbTicketCachePath=(NULL), userName=pa) error: (unable to get root cause for java.lang.NoClassDefFoundError) (unable to get stack trace for java.lang.NoClassDefFoundError) Traceback (most recent call last): ( File "", line 1, in File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 170, in connect kerb_ticket=kerb_ticket, driver=driver) File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 37, in __init__ self._connect(host, port, user, kerb_ticket, driver) File "pyarrow/io-hdfs.pxi", line 87, in pyarrow.lib.HadoopFileSystem._connect (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) pyarrow.lib.ArrowIOError: HDFS connection failed - export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" python import pyarrow.hdfs as hdfs; fs = hdfs.connect(user="hdfs") Works properly. I can't find reason why first CLASSPATH doesn't work and second one does, because it's path to same .jar, just with extra symlink in it. To me, it looks like pyarrow.lib.check has problem with symlinks defined with many ../.../.. . I would expect that pyarrow would work with any definition of path to .jar Please notice that path are not generated at random
[jira] [Commented] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type
[ https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356878#comment-16356878 ] ASF GitHub Bot commented on ARROW-2073: --- xhochy closed pull request #1572: ARROW-2073: [Python] Create struct array from sequence of tuples URL: https://github.com/apache/arrow/pull/1572 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 1e431c29f..f0e5449b6 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -771,18 +771,21 @@ class StructConverter : public TypedConverterVisitorAppend()); -if (!PyDict_Check(obj)) { - return Status::TypeError("dict value expected for struct type"); +// Note heterogenous sequences are not allowed +if (ARROW_PREDICT_FALSE(source_kind_ == UNKNOWN)) { + if (PyDict_Check(obj)) { +source_kind_ = DICTS; + } else if (PyTuple_Check(obj)) { +source_kind_ = TUPLES; + } } -// NOTE we're ignoring any extraneous dict items -for (int i = 0; i < num_fields_; i++) { - PyObject* nameobj = PyList_GET_ITEM(field_name_list_.obj(), i); - PyObject* valueobj = PyDict_GetItem(obj, nameobj); // borrowed - RETURN_IF_PYERROR(); - RETURN_NOT_OK(value_converters_[i]->AppendSingle(valueobj ? valueobj : Py_None)); +if (PyDict_Check(obj) && source_kind_ == DICTS) { + return AppendDictItem(obj); +} else if (PyTuple_Check(obj) && source_kind_ == TUPLES) { + return AppendTupleItem(obj); +} else { + return Status::TypeError("Expected sequence of dicts or tuples for struct type"); } - -return Status::OK(); } // Append a missing item @@ -797,9 +800,33 @@ class StructConverter : public TypedConverterVisitorAppendSingle(valueobj ? valueobj : Py_None)); +} +return Status::OK(); + } + + Status AppendTupleItem(PyObject* obj) { +if (PyTuple_GET_SIZE(obj) != num_fields_) { + return Status::Invalid("Tuple size must be equal to number of struct fields"); +} +for (int i = 0; i < num_fields_; i++) { + PyObject* valueobj = PyTuple_GET_ITEM(obj, i); + RETURN_NOT_OK(value_converters_[i]->AppendSingle(valueobj)); +} +return Status::OK(); + } + std::vector> value_converters_; OwnedRef field_name_list_; int num_fields_; + // Whether we're converting from a sequence of dicts or tuples + enum { UNKNOWN, DICTS, TUPLES } source_kind_ = UNKNOWN; }; class DecimalConverter diff --git a/python/benchmarks/convert_builtins.py b/python/benchmarks/convert_builtins.py index 92b2b850f..a4dc9f262 100644 --- a/python/benchmarks/convert_builtins.py +++ b/python/benchmarks/convert_builtins.py @@ -144,11 +144,21 @@ def generate_int_list_list(self, n, min_size, max_size, partial(self.generate_int_list, none_prob=none_prob), n, min_size, max_size, none_prob) +def generate_tuple_list(self, n, none_prob=DEFAULT_NONE_PROB): +""" +Generate a list of tuples with random values. +Each tuple has the form `(int value, float value, bool value)` +""" +dicts = self.generate_dict_list(n, none_prob=none_prob) +tuples = [(d.get('u'), d.get('v'), d.get('w')) + if d is not None else None + for d in dicts] +assert len(tuples) == n +return tuples def generate_dict_list(self, n, none_prob=DEFAULT_NONE_PROB): """ -Generate a list of dicts with a random size between *min_size* and -*max_size*. +Generate a list of dicts with random values. Each dict has the form `{'u': int value, 'v': float value, 'w': bool value}` """ ints = self.generate_int_list(n, none_prob=none_prob) @@ -179,12 +189,14 @@ def get_type_and_builtins(self, n, type_name): """ size = None -if type_name in ('bool', 'ascii', 'unicode', 'int64 list', 'struct'): +if type_name in ('bool', 'ascii', 'unicode', 'int64 list'): kind = type_name elif type_name.startswith(('int', 'uint')): kind = 'int' elif type_name.startswith('float'): kind = 'float' +elif type_name.startswith('struct'): +kind = 'struct' elif type_name == 'binary': kind = 'varying binary' elif type_name.startswith('binary'): @@ -226,6 +238,7 @@ def get_type_and_builtins(self, n, type_name): 'int64 list': partial(self.generate_int_list_list, min_size=0, max_size=20), 'struct': sel
[jira] [Resolved] (ARROW-2073) [Python] Create StructArray from sequence of tuples given a known data type
[ https://issues.apache.org/jira/browse/ARROW-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn resolved ARROW-2073. Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1572 [https://github.com/apache/arrow/pull/1572] > [Python] Create StructArray from sequence of tuples given a known data type > --- > > Key: ARROW-2073 > URL: https://issues.apache.org/jira/browse/ARROW-2073 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Following ARROW-1705, we should support calling {{pa.array}} with a sequence > of tuples, presuming a struct type is passed for the {{type}} parameter. > We also probably want to disallow mixed inputs, e.g. a sequence of both dicts > and tuples. The user should use only one idiom at a time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356889#comment-16356889 ] Uwe L. Korn commented on ARROW-1021: {{.pxi}} files are not meant to be used directly. They all render into {{pyarrow.lib}} (see the includes in {{pyarrow/lib.pyx}}) > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356896#comment-16356896 ] Antoine Pitrou commented on ARROW-1021: --- Thanks. So, IIUC, 3rd party Cython code is expected to use only the symbols defined as {{cdef public}} in {{lib.pxd}}? > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2114) [Python] Pull latest docker manylinux1 image
Uwe L. Korn created ARROW-2114: -- Summary: [Python] Pull latest docker manylinux1 image Key: ARROW-2114 URL: https://issues.apache.org/jira/browse/ARROW-2114 Project: Apache Arrow Issue Type: Task Reporter: Uwe L. Korn Assignee: Uwe L. Korn Fix For: 0.9.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356908#comment-16356908 ] Uwe L. Korn commented on ARROW-2114: [~wesmckinn] These changes are minimal and only an artifact of the docker maintenance. Are you ok when in future I don't make tickets for them? (They shouldn't show up in the changelog) > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356909#comment-16356909 ] ASF GitHub Bot commented on ARROW-2114: --- xhochy opened a new pull request #1577: ARROW-2114: [Python] Pull latest docker manylinux1 image [skip appveyor] URL: https://github.com/apache/arrow/pull/1577 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2114: -- Labels: pull-request-available (was: ) > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356922#comment-16356922 ] Antoine Pitrou commented on ARROW-1021: --- I've tried to add a test for the Cython API: [https://github.com/apache/arrow/pull/1576/files#diff-8dbd260ac34efe0c510155d2a86c1405] Does that reflect the intended idiom for calling into that API? > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356935#comment-16356935 ] Uwe L. Korn commented on ARROW-1021: {quote}So, IIUC, 3rd party Cython code is expected to use only the symbols defined as {{cdef public}} in {{lib.pxd}}? {quote} Yes. {quote}Does that reflect the intended idiom for calling into that API? {quote} Also yes but until now I have only used that API with {{boost::python}} and {{pybind11}}. I will add that afterwards to the documentation. > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16356955#comment-16356955 ] Antoine Pitrou commented on ARROW-1021: --- Note it is currently required to also add the Numpy C include path: https://travis-ci.org/pitrou/arrow/jobs/338970086#L3616-L3623 {code} In file included from pyarrow_cython_example.cpp:571: In file included from /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/api.h:22: In file included from /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/arrow_to_python.h:27: In file included from /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/python_to_arrow.h:26: In file included from /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/common.h:23: In file included from /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/config.h:23: /Users/travis/build/pitrou/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/include/arrow/python/numpy_interop.h:23:10: fatal error: 'numpy/numpyconfig.h' file not found #include {code} > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1501) [JS] JavaScript integration tests
[ https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357061#comment-16357061 ] Brian Hulette commented on ARROW-1501: -- [~wesmckinn] the integration tests still only test our ability to consume arrow data with JS, so we may want to keep this open until we have a JS writer we can use. I'll create some more issues to track that side of things > [JS] JavaScript integration tests > - > > Key: ARROW-1501 > URL: https://issues.apache.org/jira/browse/ARROW-1501 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Tracking JIRA for integration test-related issues -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2115) [JS] Test arrow data production in integration test
Brian Hulette created ARROW-2115: Summary: [JS] Test arrow data production in integration test Key: ARROW-2115 URL: https://issues.apache.org/jira/browse/ARROW-2115 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette Currently the integration tests only treat the JS implementation as a consumer, and we also need to test its ability to produce arrow data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2115) [JS] Test arrow data production in integration test
[ https://issues.apache.org/jira/browse/ARROW-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-2115: - Issue Type: Improvement (was: Bug) > [JS] Test arrow data production in integration test > --- > > Key: ARROW-2115 > URL: https://issues.apache.org/jira/browse/ARROW-2115 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Priority: Major > > Currently the integration tests only treat the JS implementation as a > consumer, and we also need to test its ability to produce arrow data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2116) [JS] Implement IPC writer
Brian Hulette created ARROW-2116: Summary: [JS] Implement IPC writer Key: ARROW-2116 URL: https://issues.apache.org/jira/browse/ARROW-2116 Project: Apache Arrow Issue Type: Bug Components: JavaScript Reporter: Brian Hulette -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2116) [JS] Implement IPC writer
[ https://issues.apache.org/jira/browse/ARROW-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-2116: - Issue Type: Improvement (was: Bug) > [JS] Implement IPC writer > - > > Key: ARROW-2116 > URL: https://issues.apache.org/jira/browse/ARROW-2116 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2116) [JS] Implement IPC writer
[ https://issues.apache.org/jira/browse/ARROW-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357088#comment-16357088 ] Brian Hulette commented on ARROW-2116: -- [~paul.e.taylor] didn't you work on a JS writer? > [JS] Implement IPC writer > - > > Key: ARROW-2116 > URL: https://issues.apache.org/jira/browse/ARROW-2116 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357119#comment-16357119 ] ASF GitHub Bot commented on ARROW-1425: --- wesm commented on issue #1575: ARROW-1425: [Python] Document Arrow timestamps, and interops w/ other systems URL: https://github.com/apache/arrow/pull/1575#issuecomment-364159244 We don't yet have a place (outside `format/`) for language-independent or cross-language documentation. This would be very helpful to get set up if we can agree as a community what tool to use to build this documentation This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Heimir Thor Sverrisson >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357121#comment-16357121 ] Wes McKinney commented on ARROW-2114: - Sounds good to me, no need to create JIRAs for Docker image maintenance > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357155#comment-16357155 ] ASF GitHub Bot commented on ARROW-1021: --- pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ pyarrow API URL: https://github.com/apache/arrow/pull/1576#issuecomment-364164874 This required a bit more churn than I expected (especially to get the Cython example and test to work). I think this is ready for review now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357208#comment-16357208 ] ASF GitHub Bot commented on ARROW-1021: --- pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ pyarrow API URL: https://github.com/apache/arrow/pull/1576#issuecomment-364175414 Hmm, there's still an AppVeyor failure. Will try to fix :-/ This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357304#comment-16357304 ] ASF GitHub Bot commented on ARROW-1973: --- wesm commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167015825 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ + if (base == py_ref_) { +Py_INCREF(base); Review comment: Another way to handle this would be to put the INCREF in the branch without the capsule. Then if `PyArray_SetBaseObject` fails, we decref `base` unconditionally (which will either destroy the capsule or reset the `py_ref_` ref count to what it was originally) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357303#comment-16357303 ] ASF GitHub Bot commented on ARROW-1973: --- wesm commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167014864 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ Review comment: Can you use C++-style comment with `//`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1973: -- Labels: pull-request-available (was: ) > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357320#comment-16357320 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ + if (base == py_ref_) { +Py_INCREF(base); Review comment: True, I can change. Most of the work here was understanding the accounting flow :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357323#comment-16357323 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167018073 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ Review comment: Yep This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357324#comment-16357324 ] ASF GitHub Bot commented on ARROW-1973: --- pitrou commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167018125 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -502,18 +502,20 @@ template inline Status ConvertListsLike(PandasOptions options, const std::shared_ptr& col, PyObject** out_values) { const ChunkedArray& data = *col->data().get(); - auto list_type = std::static_pointer_cast(col->type()); + const auto& list_type = static_cast(*col->type()); // Get column of underlying value arrays std::vector> value_arrays; for (int c = 0; c < data.num_chunks(); c++) { -auto arr = std::static_pointer_cast(data.chunk(c)); -value_arrays.emplace_back(arr->values()); +const auto& arr = static_cast(*data.chunk(c)); +value_arrays.emplace_back(arr.values()); } - auto flat_column = std::make_shared(list_type->value_field(), value_arrays); + auto flat_column = std::make_shared(list_type.value_field(), value_arrays); // TODO(ARROW-489): Currently we don't have a Python reference for single columns. //Storing a reference to the whole Array would be to expensive. - PyObject* numpy_array; + OwnedRef owned_numpy_array; Review comment: This one doesn't seem used. By passing `&numpy_array` below you're not changing the internal pointer. Perhaps use `ref()` instead? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357319#comment-16357319 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ + if (base == py_ref_) { +Py_INCREF(base); Review comment: True, I can change. Most of the work here was understanding the accounting flow here :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357322#comment-16357322 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167018073 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ Review comment: Ye This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357330#comment-16357330 ] ASF GitHub Bot commented on ARROW-1021: --- pitrou commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ pyarrow API URL: https://github.com/apache/arrow/pull/1576#issuecomment-364197384 Ok, I'm afraid I don't know how to get the Windows Cython test to work. Here is the log: ``` pyarrow_cython_example.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) public: __int64 __cdecl arrow::Array::length(void)const " (__imp_?length@Array@arrow@@QEBA_JXZ) C:\Users\appveyor\AppData\Local\Temp\1\pytest-of-appveyor\pytest-0\test_cython_api0\pyarrow_cython_example.cp35-win_amd64.pyd : fatal error LNK1120: 1 unresolved externals ``` (from https://ci.appveyor.com/project/pitrou/arrow/build/1.0.60/job/aruj4pno67s4xpcf#L6242 ) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357337#comment-16357337 ] Wes McKinney commented on ARROW-2113: - [~michal.danko] as far as I understand the issue, this does not have to do with pyarrow in particular, it is a problem with the system configuration for using libhdfs, which is out of our control. We are loading {{libjvm}} and {{libhdfs}} at runtime and leaving it to {{libhdfs}} to initialize the JVM and load the relevant HDFS client JARs, which it is evidently having some trouble with the {{CLASSPATH}}. You should be able to reproduce the issue from a standalone C program that uses libhdfs to connect to the cluster. Could you perhaps seek counsel from the Apache Hadoop community? > [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS > connection failed" > > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 >Reporter: Michal Danko >Priority: Major > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "", line 1, in > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > - > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2113) [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed"
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357341#comment-16357341 ] Wes McKinney commented on ARROW-2113: - I actually just remembered that we are setting that classpath from the output of {{hadoop --classpath}}, see: https://github.com/apache/arrow/blob/master/python/pyarrow/hdfs.py#L116 So the reason that this is failing in the first instance is that {{hadoop}} is in the path, whereas in the second, it is setting the correct classpath. Either way the CLASSPATH you have set does not appear to have the requisite JAR files It seems we should be more specific about detecting that Hadoop JARs are in the path. I will open a new bug report about this > [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS > connection failed" > > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 >Reporter: Michal Danko >Priority: Major > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "", line 1, in > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > - > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2113: Summary: [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic (was: [Python] Connect to hdfs failing with "pyarrow.lib.ArrowIOError: HDFS connection failed") > [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the > classpath setting HDFS logic > - > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 >Reporter: Michal Danko >Priority: Major > Fix For: 0.9.0 > > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "", line 1, in > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > - > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2113: Fix Version/s: 0.9.0 > [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the > classpath setting HDFS logic > - > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 >Reporter: Michal Danko >Priority: Major > Fix For: 0.9.0 > > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "", line 1, in > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > - > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic
[ https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357345#comment-16357345 ] Wes McKinney commented on ARROW-2113: - I renamed this JIRA to reflect the issue. If someone could submit a patch that would be very helpful > [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the > classpath setting HDFS logic > - > > Key: ARROW-2113 > URL: https://issues.apache.org/jira/browse/ARROW-2113 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH > 5.13.1 >Reporter: Michal Danko >Priority: Major > Fix For: 0.9.0 > > > Steps to replicate the issue: > mkdir /tmp/test > cd /tmp/test > mkdir jars > cd jars > touch test1.jar > mkdir -p ../lib/zookeeper > cd ../lib/zookeeper > ln -s ../../jars/test1.jar ./test1.jar > ln -s test1.jar test.jar > mkdir -p ../hadoop/lib > cd ../hadoop/lib > ln -s ../../../lib/zookeeper/test.jar ./test.jar > (this part depends on your configuration you need those values for > pyarrow.hdfs to work: ) > (path to libjvm: ) > (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera) > (path to libhdfs: ) > (export > LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/) > export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Ends with error: > > loadFileSystems error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, > kerbTicketCachePath=(NULL), userName=pa) error: > (unable to get root cause for java.lang.NoClassDefFoundError) > (unable to get stack trace for java.lang.NoClassDefFoundError) > Traceback (most recent call last): ( > File "", line 1, in > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 170, in connect > kerb_ticket=kerb_ticket, driver=driver) > File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line > 37, in __init__ > self._connect(host, port, user, kerb_ticket, driver) > File "pyarrow/io-hdfs.pxi", line 87, in > pyarrow.lib.HadoopFileSystem._connect > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673) > File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345) > pyarrow.lib.ArrowIOError: HDFS connection failed > - > > export CLASSPATH="/tmp/test/lib/zookeeper/test.jar" > python > import pyarrow.hdfs as hdfs; > fs = hdfs.connect(user="hdfs") > > Works properly. > > I can't find reason why first CLASSPATH doesn't work and second one does, > because it's path to same .jar, just with extra symlink in it. To me, it > looks like pyarrow.lib.check has problem with symlinks defined with many > ../.../.. . > I would expect that pyarrow would work with any definition of path to .jar > Please notice that path are not generated at random, it is path copied from > Cloudera distribution of Hadoop (original file was zookeeper.jar), > Because of this issue, our customer currently can't use pyarrow lib for oozie > workflows. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357411#comment-16357411 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167034173 ## File path: cpp/src/arrow/python/arrow_to_pandas.cc ## @@ -502,18 +502,20 @@ template inline Status ConvertListsLike(PandasOptions options, const std::shared_ptr& col, PyObject** out_values) { const ChunkedArray& data = *col->data().get(); - auto list_type = std::static_pointer_cast(col->type()); + const auto& list_type = static_cast(*col->type()); // Get column of underlying value arrays std::vector> value_arrays; for (int c = 0; c < data.num_chunks(); c++) { -auto arr = std::static_pointer_cast(data.chunk(c)); -value_arrays.emplace_back(arr->values()); +const auto& arr = static_cast(*data.chunk(c)); +value_arrays.emplace_back(arr.values()); } - auto flat_column = std::make_shared(list_type->value_field(), value_arrays); + auto flat_column = std::make_shared(list_type.value_field(), value_arrays); // TODO(ARROW-489): Currently we don't have a Python reference for single columns. //Storing a reference to the whole Array would be to expensive. - PyObject* numpy_array; + OwnedRef owned_numpy_array; Review comment: Yep, thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357422#comment-16357422 ] ASF GitHub Bot commented on ARROW-2114: --- wesm closed pull request #1577: ARROW-2114: [Python] Pull latest docker manylinux1 image [skip appveyor] URL: https://github.com/apache/arrow/pull/1577 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 1ade9ab10..919a32be7 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -14,7 +14,7 @@ # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. -FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-2087 +FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:latest ADD arrow /arrow WORKDIR /arrow/cpp This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2114) [Python] Pull latest docker manylinux1 image
[ https://issues.apache.org/jira/browse/ARROW-2114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2114. - Resolution: Fixed Issue resolved by pull request 1577 [https://github.com/apache/arrow/pull/1577] > [Python] Pull latest docker manylinux1 image > > > Key: ARROW-2114 > URL: https://issues.apache.org/jira/browse/ARROW-2114 > Project: Apache Arrow > Issue Type: Task >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357436#comment-16357436 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167037985 ## File path: cpp/src/arrow/table-test.cc ## @@ -588,6 +588,101 @@ TEST_F(TestRecordBatch, Slice) { } } +TEST_F(TestRecordBatch, AddColumn) { + const int length = 10; + + auto field1 = field("f1", int32()); + auto field2 = field("f2", uint8()); + auto field3 = field("f3", int16()); + + auto schema1 = ::arrow::schema({field1, field2}); + auto schema2 = ::arrow::schema({field2, field3}); + auto schema3 = ::arrow::schema({field2}); + + auto array1 = MakeRandomArray(length); + auto array2 = MakeRandomArray(length); + auto array3 = MakeRandomArray(length); + + auto batch1 = RecordBatch::Make(schema1, length, {array1, array2}); + auto batch2 = RecordBatch::Make(schema2, length, {array2, array3}); + auto batch3 = RecordBatch::Make(schema3, length, {array2}); + + const RecordBatch& batch = *batch3; + std::shared_ptr result; + + // Negative tests with invalid index + Status status = batch.AddColumn(5, field1, array1->data(), &result); + ASSERT_TRUE(status.IsInvalid()); + status = batch.AddColumn(-1, field1, array1->data(), &result); + ASSERT_TRUE(status.IsInvalid()); + + // Negative test with wrong length + auto longer_col = MakeRandomArray(length + 1); + status = batch.AddColumn(0, field1, longer_col->data(), &result); + ASSERT_TRUE(status.IsInvalid()); + + // Negative test with mismatch type + status = batch.AddColumn(0, field1, array2->data(), &result); + ASSERT_TRUE(status.IsInvalid()); + + ASSERT_OK(batch.AddColumn(0, field1, array1->data(), &result)); + ASSERT_TRUE(result->Equals(*batch1)); + + ASSERT_OK(batch.AddColumn(1, field3, array3->data(), &result)); + ASSERT_TRUE(result->Equals(*batch2)); +} + +TEST_F(TestRecordBatch, RemoveColumn) { + const int length = 10; + + auto field1 = field("f1", int32()); + auto field2 = field("f2", uint8()); + auto field3 = field("f3", int16()); + + auto schema1 = ::arrow::schema({field1, field2, field3}); + auto schema2 = ::arrow::schema({field2, field3}); + auto schema3 = ::arrow::schema({field1, field3}); + auto schema4 = ::arrow::schema({field1, field2}); + + auto array1 = MakeRandomArray(length); + auto array2 = MakeRandomArray(length); + auto array3 = MakeRandomArray(length); + + auto batch1 = RecordBatch::Make(schema1, length, {array1, array2, array3}); + auto batch2 = RecordBatch::Make(schema2, length, {array2, array3}); + auto batch3 = RecordBatch::Make(schema3, length, {array1, array3}); + auto batch4 = RecordBatch::Make(schema4, length, {array1, array2}); + + const RecordBatch& batch = *batch1; + std::shared_ptr result; + + ASSERT_OK(batch.RemoveColumn(0, &result)); + ASSERT_TRUE(result->Equals(*batch2)); + + ASSERT_OK(batch.RemoveColumn(1, &result)); + ASSERT_TRUE(result->Equals(*batch3)); + + ASSERT_OK(batch.RemoveColumn(2, &result)); + ASSERT_TRUE(result->Equals(*batch4)); Review comment: Add a test for removing an out of bounds index This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357437#comment-16357437 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167037093 ## File path: cpp/src/arrow/record_batch.cc ## @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch { std::shared_ptr column_data(int i) const override { return columns_[i]; } + Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, Review comment: Pass `Array` here instead, since that's more likely to be what the user has This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357434#comment-16357434 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167037375 ## File path: cpp/src/arrow/record_batch.cc ## @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch { std::shared_ptr column_data(int i) const override { return columns_[i]; } + Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, + std::shared_ptr* out) const override { +if (i < 0 || i > num_columns() + 1) { + return Status::Invalid("Invalid column index"); +} +if (field == nullptr) { + std::stringstream ss; + ss << "Field " << i << " was null"; + return Status::Invalid(ss.str()); +} +if (column == nullptr) { + std::stringstream ss; + ss << "Column " << i << " was null"; + return Status::Invalid(ss.str()); +} +if (!field->type()->Equals(column->type)) { + std::stringstream ss; + ss << "Column data type " << field->type()->name() + << " does not match field data type " << column->type->name(); + return Status::Invalid(ss.str()); +} +if (column->length != num_rows_) { + std::stringstream ss; + ss << "Added column's length must match record batch's length. Expected length " + << num_rows_ << " but got length " << column->length; + return Status::Invalid(ss.str()); +} + +std::shared_ptr new_schema; +RETURN_NOT_OK(schema_->AddField(i, field, &new_schema)); Review comment: We could leave the boundschecking above to `Schema::AddField` -- could you also check whether that function has the issues described above? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357431#comment-16357431 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167036590 ## File path: cpp/src/arrow/record_batch.cc ## @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch { std::shared_ptr column_data(int i) const override { return columns_[i]; } + Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, + std::shared_ptr* out) const override { +if (i < 0 || i > num_columns() + 1) { Review comment: I think this should be `i > num_columns()`. This is also a bug in `SimpleTable::AddColumn`. Can you add a test where `i == num_columns()`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357432#comment-16357432 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167036916 ## File path: cpp/src/arrow/record_batch.cc ## @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch { std::shared_ptr column_data(int i) const override { return columns_[i]; } + Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, + std::shared_ptr* out) const override { +if (i < 0 || i > num_columns() + 1) { + return Status::Invalid("Invalid column index"); +} +if (field == nullptr) { + std::stringstream ss; + ss << "Field " << i << " was null"; + return Status::Invalid(ss.str()); +} +if (column == nullptr) { + std::stringstream ss; + ss << "Column " << i << " was null"; + return Status::Invalid(ss.str()); +} Review comment: I think these should both be `DCHECK`, since null would indicate a problem with application logic, so should be a "can't fail" This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357438#comment-16357438 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167037896 ## File path: cpp/src/arrow/table-test.cc ## @@ -588,6 +588,101 @@ TEST_F(TestRecordBatch, Slice) { } } +TEST_F(TestRecordBatch, AddColumn) { + const int length = 10; + + auto field1 = field("f1", int32()); + auto field2 = field("f2", uint8()); + auto field3 = field("f3", int16()); + + auto schema1 = ::arrow::schema({field1, field2}); + auto schema2 = ::arrow::schema({field2, field3}); + auto schema3 = ::arrow::schema({field2}); + + auto array1 = MakeRandomArray(length); + auto array2 = MakeRandomArray(length); + auto array3 = MakeRandomArray(length); + + auto batch1 = RecordBatch::Make(schema1, length, {array1, array2}); + auto batch2 = RecordBatch::Make(schema2, length, {array2, array3}); + auto batch3 = RecordBatch::Make(schema3, length, {array2}); + + const RecordBatch& batch = *batch3; + std::shared_ptr result; + + // Negative tests with invalid index + Status status = batch.AddColumn(5, field1, array1->data(), &result); Review comment: Add a test for `batch.AddColumn(2, ...)` to address the edge case in the implementation. We probably need a corresponding test for `Table` (and maybe also `Schema`). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357435#comment-16357435 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167037500 ## File path: cpp/src/arrow/record_batch.h ## @@ -96,6 +96,14 @@ class ARROW_EXPORT RecordBatch { /// \return an internal ArrayData object virtual std::shared_ptr column_data(int i) const = 0; + /// \brief Add column to the record batch, producing a new RecordBatch + virtual Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, + std::shared_ptr* out) const = 0; + + /// \brief Remove column from the record batch, producing a new RecordBatch + virtual Status RemoveColumn(int i, std::shared_ptr* out) const = 0; Review comment: Can you document the parameters for these functions? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-969) [C++/Python] Add add/remove field functions for RecordBatch
[ https://issues.apache.org/jira/browse/ARROW-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357439#comment-16357439 ] ASF GitHub Bot commented on ARROW-969: -- wesm commented on a change in pull request #1574: ARROW-969: [C++] Add add/remove field functions for RecordBatch URL: https://github.com/apache/arrow/pull/1574#discussion_r167038477 ## File path: cpp/src/arrow/record_batch.cc ## @@ -78,6 +79,52 @@ class SimpleRecordBatch : public RecordBatch { std::shared_ptr column_data(int i) const override { return columns_[i]; } + Status AddColumn(int i, const std::shared_ptr& field, + const std::shared_ptr& column, + std::shared_ptr* out) const override { +if (i < 0 || i > num_columns() + 1) { + return Status::Invalid("Invalid column index"); +} +if (field == nullptr) { + std::stringstream ss; + ss << "Field " << i << " was null"; + return Status::Invalid(ss.str()); +} +if (column == nullptr) { + std::stringstream ss; + ss << "Column " << i << " was null"; + return Status::Invalid(ss.str()); +} Review comment: I took a look at `SimpleTable::AddColumn`; there `col` is being null-checked -- I think that should also be a DCHECK This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++/Python] Add add/remove field functions for RecordBatch > --- > > Key: ARROW-969 > URL: https://issues.apache.org/jira/browse/ARROW-969 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Analogous to the Table equivalents -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1942) [C++] Hash table specializations for small integers
[ https://issues.apache.org/jira/browse/ARROW-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357447#comment-16357447 ] ASF GitHub Bot commented on ARROW-1942: --- wesm commented on issue #1551: ARROW-1942: [C++] Hash table specializations for small integers URL: https://github.com/apache/arrow/pull/1551#issuecomment-364220281 @xuepanchen I made the functor changes. Can you add a benchmark for the 8-bit integer case? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Hash table specializations for small integers > --- > > Key: ARROW-1942 > URL: https://issues.apache.org/jira/browse/ARROW-1942 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There is no need to use a dynamically-sized hash table with uint8, int8, > since a fixed-size lookup table can be used and avoid hashing altogether -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1501) [JS] JavaScript integration tests
[ https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357449#comment-16357449 ] Wes McKinney commented on ARROW-1501: - Cool, thanks > [JS] JavaScript integration tests > - > > Key: ARROW-1501 > URL: https://issues.apache.org/jira/browse/ARROW-1501 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Tracking JIRA for integration test-related issues -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357514#comment-16357514 ] ASF GitHub Bot commented on ARROW-1021: --- wesm commented on issue #1576: ARROW-1021: [Python] Add documentation for C++ pyarrow API URL: https://github.com/apache/arrow/pull/1576#issuecomment-364235254 I can take a look at the Windows issue (I have a machine to test on) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357576#comment-16357576 ] ASF GitHub Bot commented on ARROW-1973: --- wesm commented on issue #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#issuecomment-364249569 This needed a clang-format This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357581#comment-16357581 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#issuecomment-364250198 Hm, okay. I did run that. It's probably because I'm using clang 5 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357584#comment-16357584 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#issuecomment-364250409 How do we decide when to upgrade? When it's released on ubuntu or some other slowish moving distro? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357589#comment-16357589 ] ASF GitHub Bot commented on ARROW-1973: --- wesm commented on issue #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#issuecomment-364251400 It looks like LLVM 5 has been promoted to stable (according to http://apt.llvm.org/) so I think we should upgrade our pin to clang 5.0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2117) [C++] Pin clang to version 5.0
Phillip Cloud created ARROW-2117: Summary: [C++] Pin clang to version 5.0 Key: ARROW-2117 URL: https://issues.apache.org/jira/browse/ARROW-2117 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.9.0 Reporter: Phillip Cloud Assignee: Phillip Cloud Let's do this after the next release. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.
[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357616#comment-16357616 ] ASF GitHub Bot commented on ARROW-1973: --- cpcloud commented on issue #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#issuecomment-364257181 Opened a JIRA for it: https://issues.apache.org/jira/browse/ARROW-2117 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > -- > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-987) [JS] Implement JSON writer for Integration tests
[ https://issues.apache.org/jira/browse/ARROW-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-987: Fix Version/s: (was: 0.9.0) JS-0.3.0 > [JS] Implement JSON writer for Integration tests > > > Key: ARROW-987 > URL: https://issues.apache.org/jira/browse/ARROW-987 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Brian Hulette >Priority: Major > Fix For: JS-0.3.0 > > > Rather than storing generated binary files in the repo, we could just run the > integration tests on the JS implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1501) [JS] JavaScript integration tests
[ https://issues.apache.org/jira/browse/ARROW-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-1501: - Fix Version/s: (was: 0.9.0) JS-0.3.0 > [JS] JavaScript integration tests > - > > Key: ARROW-1501 > URL: https://issues.apache.org/jira/browse/ARROW-1501 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: JS-0.3.0 > > > Tracking JIRA for integration test-related issues -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1870) [JS] Enable build scripts to work with NodeJS 6.10.2 LTS
[ https://issues.apache.org/jira/browse/ARROW-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-1870: - Fix Version/s: (was: 0.9.0) JS-0.3.0 > [JS] Enable build scripts to work with NodeJS 6.10.2 LTS > > > Key: ARROW-1870 > URL: https://issues.apache.org/jira/browse/ARROW-1870 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Fix For: JS-0.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2044) [JS] Typings should be a regular dependency
[ https://issues.apache.org/jira/browse/ARROW-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-2044: - Fix Version/s: (was: 0.9.0) JS-0.3.0 > [JS] Typings should be a regular dependency > --- > > Key: ARROW-2044 > URL: https://issues.apache.org/jira/browse/ARROW-2044 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Reporter: Brian Hulette >Priority: Minor > Labels: pull-request-available > Fix For: JS-0.3.0 > > > Currently some typings ({{@types/node}} and {{@types/flatbuffers}}) are > devDependencies rather than dependencies, which prevents {{.d.ts}} files from > being understood in downstream projects. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1990) [JS] Add "DataFrame" object
[ https://issues.apache.org/jira/browse/ARROW-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-1990: - Fix Version/s: (was: 0.9.0) JS-0.3.0 > [JS] Add "DataFrame" object > --- > > Key: ARROW-1990 > URL: https://issues.apache.org/jira/browse/ARROW-1990 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.3.0 > > > Add a TypeScript class that can perform optimized dataframe operations on an > arrow {{Table}} and/or {{StructVector}}. Initially this should include > operations like filtering, counting, and scanning. Eventually this class > could include more operations like sorting, count by/group by, etc... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-951) [JS] Add generated API documentation
[ https://issues.apache.org/jira/browse/ARROW-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-951: Fix Version/s: JS-0.3.0 > [JS] Add generated API documentation > > > Key: ARROW-951 > URL: https://issues.apache.org/jira/browse/ARROW-951 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Priority: Minor > Labels: documentation > Fix For: JS-0.3.0 > > > Maybe using http://typedoc.org ? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2118) [Python] Improve error message when calling parquet.read_table on an empty file
Wes McKinney created ARROW-2118: --- Summary: [Python] Improve error message when calling parquet.read_table on an empty file Key: ARROW-2118 URL: https://issues.apache.org/jira/browse/ARROW-2118 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.9.0 Currently it raises an exception about memory mapping failing -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2119) Handle Arrow stream with zero record batch
Jingyuan Wang created ARROW-2119: Summary: Handle Arrow stream with zero record batch Key: ARROW-2119 URL: https://issues.apache.org/jira/browse/ARROW-2119 Project: Apache Arrow Issue Type: Bug Reporter: Jingyuan Wang It looks like currently many places of the code assume that there needs to be at least one record batch for streaming format. Is zero-recordbatch not supported by design? e.g. [https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java#L45] {code:none} public static void convert(InputStream in, OutputStream out) throws IOException { BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { VectorSchemaRoot root = reader.getVectorSchemaRoot(); // load the first batch before instantiating the writer so that we have any dictionaries if (!reader.loadNextBatch()) { throw new IOException("Unable to read first record batch"); } ... {code} Pyarrow-0.8.0 does not load 0-recordbatch stream either. It would throw an exception originated from [https://github.com/apache/arrow/blob/a95465b8ce7a32feeaae3e13d0a64102ffa590d9/cpp/src/arrow/table.cc#L309:] {code:none} Status Table::FromRecordBatches(const std::vector>& batches, std::shared_ptr* table) { if (batches.size() == 0) { return Status::Invalid("Must pass at least one record batch"); } ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1918) [JS] Integration portion of verify-release-candidate.sh fails
[ https://issues.apache.org/jira/browse/ARROW-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Hulette updated ARROW-1918: - Fix Version/s: JS-0.3.0 > [JS] Integration portion of verify-release-candidate.sh fails > - > > Key: ARROW-1918 > URL: https://issues.apache.org/jira/browse/ARROW-1918 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.8.0 >Reporter: Wes McKinney >Priority: Major > Fix For: JS-0.3.0 > > > I'm going to temporarily disable this in my fixes in ARROW-1917 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration
[ https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357805#comment-16357805 ] ASF GitHub Bot commented on ARROW-1579: --- BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests URL: https://github.com/apache/arrow/pull/1319#issuecomment-344361239 This is currently a WIP, the Scala/Java tests are able to run Left TODO: - [x] Run PySpark Tests - [ ] Verify working with docker-compose and existing volumes in arrow/dev - [x] Check why Zinc is unable to run in mvn build, need to enable port 3030? - [ ] Speed up pyarrow build using conda prefix as toolchain This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Add dockerized test setup to validate Spark integration > -- > > Key: ARROW-1579 > URL: https://issues.apache.org/jira/browse/ARROW-1579 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > cc [~bryanc] -- the goal of this will be to validate master-to-master to > catch any regressions in the Spark integration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration
[ https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357806#comment-16357806 ] ASF GitHub Bot commented on ARROW-1579: --- BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests URL: https://github.com/apache/arrow/pull/1319#issuecomment-364306888 Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below.. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Add dockerized test setup to validate Spark integration > -- > > Key: ARROW-1579 > URL: https://issues.apache.org/jira/browse/ARROW-1579 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > cc [~bryanc] -- the goal of this will be to validate master-to-master to > catch any regressions in the Spark integration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration
[ https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357808#comment-16357808 ] ASF GitHub Bot commented on ARROW-1579: --- BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests URL: https://github.com/apache/arrow/pull/1319#issuecomment-364306888 Ok, I finally got this to build all and pass all tests! There are still a couple of issues to work out though, I'll discuss below.. Btw, to get the correct `pyarrow.__version__` from the dev env, you do need to have all git tags fetched and install `setuptools_scm` from pip or conda. @xhochy , `setuptools_scm` wasn't listed in any of the developer docs I could find, should it be added to the list of dependent packages for setting up a conda env? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Add dockerized test setup to validate Spark integration > -- > > Key: ARROW-1579 > URL: https://issues.apache.org/jira/browse/ARROW-1579 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > cc [~bryanc] -- the goal of this will be to validate master-to-master to > catch any regressions in the Spark integration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration
[ https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357810#comment-16357810 ] ASF GitHub Bot commented on ARROW-1579: --- BryanCutler commented on a change in pull request #1319: [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests URL: https://github.com/apache/arrow/pull/1319#discussion_r167120268 ## File path: python/pyarrow/__init__.py ## @@ -24,7 +24,7 @@ # package is not installed try: import setuptools_scm -__version__ = setuptools_scm.get_version('../') +__version__ = setuptools_scm.get_version(root='../../', relative_to=__file__) Review comment: @xhochy and @wesm , I needed to change this because it would only give a version if run under ARROW_HOME/python directory. So when running Spark tests, on importing pyarrow it would return `None`. Making it relative to the `__file__` seemed to fix it for all cases. I can make this a separate JIRA if you think that would be better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Add dockerized test setup to validate Spark integration > -- > > Key: ARROW-1579 > URL: https://issues.apache.org/jira/browse/ARROW-1579 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > cc [~bryanc] -- the goal of this will be to validate master-to-master to > catch any regressions in the Spark integration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1579) [Java] Add dockerized test setup to validate Spark integration
[ https://issues.apache.org/jira/browse/ARROW-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357819#comment-16357819 ] ASF GitHub Bot commented on ARROW-1579: --- BryanCutler commented on issue #1319: [WIP] ARROW-1579: [Java] Adding containerized Spark Integration tests URL: https://github.com/apache/arrow/pull/1319#issuecomment-364309234 @xhochy , I could not get Arrow C++ to build with `export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX`, I would get a linking error with gflags like "undefined reference google::FlagRegisterer::FlagRegisterer". I thought maybe it was because I wasn't using g++ 4.9, but I had no luck trying to get 4.9 installed since the base image I'm using is Ubuntu 16.04. Have you ever run into this? It seemed like it was some kind of template constructor that it couldn't find.. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Java] Add dockerized test setup to validate Spark integration > -- > > Key: ARROW-1579 > URL: https://issues.apache.org/jira/browse/ARROW-1579 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Vectors >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > cc [~bryanc] -- the goal of this will be to validate master-to-master to > catch any regressions in the Spark integration -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
rip.nsk created ARROW-2120: -- Summary: Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties Key: ARROW-2120 URL: https://issues.apache.org/jira/browse/ARROW-2120 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: rip.nsk Assignee: rip.nsk -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
[ https://issues.apache.org/jira/browse/ARROW-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357833#comment-16357833 ] ASF GitHub Bot commented on ARROW-2120: --- rip-nsk opened a new pull request #1580: ARROW-2120: [C++] Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties URL: https://github.com/apache/arrow/pull/1580 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties > - > > Key: ARROW-2120 > URL: https://issues.apache.org/jira/browse/ARROW-2120 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: rip.nsk >Assignee: rip.nsk >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2120) Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties
[ https://issues.apache.org/jira/browse/ARROW-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2120: -- Labels: pull-request-available (was: ) > Add possibility to use empty _MSVC_STATIC_LIB_SUFFIX for Thirdparties > - > > Key: ARROW-2120 > URL: https://issues.apache.org/jira/browse/ARROW-2120 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: rip.nsk >Assignee: rip.nsk >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2121) Consider special casing object arrays in pandas serializers.
Robert Nishihara created ARROW-2121: --- Summary: Consider special casing object arrays in pandas serializers. Key: ARROW-2121 URL: https://issues.apache.org/jira/browse/ARROW-2121 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Robert Nishihara -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2121: -- Labels: pull-request-available (was: ) > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357951#comment-16357951 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara opened a new pull request #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581 The goal here is to get the best of both the `pandas_serialization_context` (speed at serializing pandas dataframes containing strings and other objects) and the `default_serialization_context` (correctly serializing a large class of numpy object arrays). This PR sort of messes up the function `pa.pandas_compat.dataframe_to_serialized_dict`. Is that function just a helper function for implementing the custom pandas serializers? Or is it intended to be used in other places. TODO in this PR (assuming you think this approach is reasonable): - [ ] remove `pandas_serialization_context` - [ ] make sure this code path is tested - [ ] double check that performance is good cc @wesm @pcmoritz @devin-petersohn This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357967#comment-16357967 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364344672 Well, we need to preserve the zero-copy pandas reads. Now that our ASV benchmarking setup has been rehabilitated we should be able to do that in this patch to verify performance This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357976#comment-16357976 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167148817 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,16 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. +if isinstance(block, _int.ObjectBlock): +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358038#comment-16358038 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167156435 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,18 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. Note that +# we do not use isinstance here because _int.CategoricalBlock is a +# subclass of _int.ObjectBlock. +if type(block) == _int.ObjectBlock: +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358037#comment-16358037 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167148817 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,16 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. +if isinstance(block, _int.ObjectBlock): +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.
Robert Nishihara created ARROW-2122: --- Summary: Pyarrow fails to serialize dataframe with timestamp. Key: ARROW-2122 URL: https://issues.apache.org/jira/browse/ARROW-2122 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Robert Nishihara The bug can be reproduced as follows. {code:java} import pyarrow as pa import pandas as pd s = pa.serialize({code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2122) Pyarrow fails to serialize dataframe with timestamp.
[ https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Nishihara updated ARROW-2122: Description: The bug can be reproduced as follows. {code:java} import pyarrow as pa import pandas as pd df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) s = pa.serialize(df).to_buffer() new_df = pa.deserialize(s) # this fails{code} The last line fails with {code:java} Traceback (most recent call last): File "", line 1, in File "serialization.pxi", line 441, in pyarrow.lib.deserialize File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from File "serialization.pxi", line 257, in pyarrow.lib.SerializedPyObject.deserialize File "serialization.pxi", line 174, in pyarrow.lib.SerializationContext._deserialize_callback File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in _deserialize_pandas_dataframe return pdcompat.serialized_dict_to_dataframe(data) File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in serialized_dict_to_dataframe for block in data['blocks']] File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in for block in data['blocks']] File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in _reconstruct_block dtype = _make_datetimetz(item['timezone']) File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in _make_datetimetz return DatetimeTZDtype('ns', tz=tz) File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py", line 409, in __new__ raise ValueError("DatetimeTZDtype constructor must have a tz " ValueError: DatetimeTZDtype constructor must have a tz supplied{code} was: The bug can be reproduced as follows. {code:java} import pyarrow as pa import pandas as pd s = pa.serialize({code} > Pyarrow fails to serialize dataframe with timestamp. > > > Key: ARROW-2122 > URL: https://issues.apache.org/jira/browse/ARROW-2122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Nishihara >Priority: Major > > The bug can be reproduced as follows. > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) > s = pa.serialize(df).to_buffer() > new_df = pa.deserialize(s) # this fails{code} > The last line fails with > {code:java} > Traceback (most recent call last): > File "", line 1, in > File "serialization.pxi", line 441, in pyarrow.lib.deserialize > File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from > File "serialization.pxi", line 257, in > pyarrow.lib.SerializedPyObject.deserialize > File "serialization.pxi", line 174, in > pyarrow.lib.SerializationContext._deserialize_callback > File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in > _deserialize_pandas_dataframe > return pdcompat.serialized_dict_to_dataframe(data) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > serialized_dict_to_dataframe > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in > _reconstruct_block > dtype = _make_datetimetz(item['timezone']) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in > _make_datetimetz > return DatetimeTZDtype('ns', tz=tz) > File > "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py", > line 409, in __new__ > raise ValueError("DatetimeTZDtype constructor must have a tz " > ValueError: DatetimeTZDtype constructor must have a tz supplied{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)