[jira] [Updated] (ARROW-1237) [JAVA] Expose the ability to set lastSet
[ https://issues.apache.org/jira/browse/ARROW-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SIDDHARTH TEOTIA updated ARROW-1237: Summary: [JAVA] Expose the ability to set lastSet (was: Expose the ability to set lastSet ) > [JAVA] Expose the ability to set lastSet > - > > Key: ARROW-1237 > URL: https://issues.apache.org/jira/browse/ARROW-1237 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: SIDDHARTH TEOTIA >Assignee: SIDDHARTH TEOTIA >Priority: Minor > Fix For: 0.6.0 > > > Expose the ability to set lastSet on vectors such that > Mutator.setValueCount() doesn't blow away the vector if we have previously > loaded it. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-276) [JAVA] Nullable Value Vectors should extend BaseValueVector instead of BaseDataValueVector
[ https://issues.apache.org/jira/browse/ARROW-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SIDDHARTH TEOTIA updated ARROW-276: --- Summary: [JAVA] Nullable Value Vectors should extend BaseValueVector instead of BaseDataValueVector (was: [JAVA] ) > [JAVA] Nullable Value Vectors should extend BaseValueVector instead of > BaseDataValueVector > -- > > Key: ARROW-276 > URL: https://issues.apache.org/jira/browse/ARROW-276 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Julien Le Dem >Assignee: SIDDHARTH TEOTIA > Fix For: 0.6.0 > > > Currently Nullable Vectors have an unused data vector because of this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-886) VariableLengthVectors don't reAlloc offsets
[ https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112039#comment-16112039 ] SIDDHARTH TEOTIA edited comment on ARROW-886 at 8/3/17 1:39 AM: [~elahrvivaz], As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are you fine with undoing the changes? (1) Don't explicitly reallocate the offsetVector in realloc() function of Variable Length Vectors. (2) Doing (1) will break the unit test added as part of PR 591 so we need to remove that as well. I have created a PR for the above two items -- https://github.com/apache/arrow/pull/937 this basically reverts your change for the above items. Thanks, Siddharth was (Author: siddteotia): [~elahrvivaz], As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are you fine with undoing the changes: (1) Don't explicitly reallocate the offsetVector in realloc() function of Variable Length Vectors. (2) Doing (1) will break the unit test added as part of PR 591 so we need to remove that as well. I have created a PR for the above two items -- https://github.com/apache/arrow/pull/937 this basically reverts your change for the above items. Thanks, Siddharth > VariableLengthVectors don't reAlloc offsets > --- > > Key: ARROW-886 > URL: https://issues.apache.org/jira/browse/ARROW-886 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Affects Versions: 0.3.0 >Reporter: Emilio Lahr-Vivaz >Assignee: Emilio Lahr-Vivaz > Fix For: 0.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1249) [JAVA] Expose the fillEmpties function from NullableVector.mutator
[ https://issues.apache.org/jira/browse/ARROW-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SIDDHARTH TEOTIA updated ARROW-1249: Summary: [JAVA] Expose the fillEmpties function from NullableVector.mutator (was: Expose the fillEmpties function from NullableVector.mutator) > [JAVA] Expose the fillEmpties function from NullableVector.mutator > - > > Key: ARROW-1249 > URL: https://issues.apache.org/jira/browse/ARROW-1249 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: SIDDHARTH TEOTIA >Assignee: SIDDHARTH TEOTIA >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1310) [JAVA] Revert ARROW-886
[ https://issues.apache.org/jira/browse/ARROW-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SIDDHARTH TEOTIA updated ARROW-1310: Summary: [JAVA] Revert ARROW-886 (was: Revert ARROW-886) > [JAVA] Revert ARROW-886 > --- > > Key: ARROW-1310 > URL: https://issues.apache.org/jira/browse/ARROW-1310 > Project: Apache Arrow > Issue Type: Bug >Reporter: SIDDHARTH TEOTIA >Assignee: SIDDHARTH TEOTIA > > We don't need to reallocate the underlying offsetVector every time a variable > length vector is reallocated. > Reallocation of offsetVector is taken care of by setSafe() function of the > offsetVector. > The setSafe() function of the Variable Length Vector will decide whether to > call realloc() or not. However, this should not decide whether offsetVector > needs reallocation or not. When setSafe() calls offsetVector.setSafe(), the > latter can decide whether to reallocate the offset vector or not. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-276) [JAVA]
[ https://issues.apache.org/jira/browse/ARROW-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SIDDHARTH TEOTIA updated ARROW-276: --- Summary: [JAVA] (was: Nullable Vectors should extend BaseValueVector and not BaseDataValueVector) > [JAVA] > --- > > Key: ARROW-276 > URL: https://issues.apache.org/jira/browse/ARROW-276 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Reporter: Julien Le Dem >Assignee: SIDDHARTH TEOTIA > Fix For: 0.6.0 > > > Currently Nullable Vectors have an unused data vector because of this. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets
[ https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112039#comment-16112039 ] SIDDHARTH TEOTIA commented on ARROW-886: [~elahrvivaz], As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are you fine with undoing the changes: (1) Don't explicitly reallocate the offsetVector in realloc() function of Variable Length Vectors. (2) Doing (1) will break the unit test added as part of PR 591 so we need to remove that as well. I have created a PR for the above two items -- https://github.com/apache/arrow/pull/937 this basically reverts your change for the above items. Thanks, Siddharth > VariableLengthVectors don't reAlloc offsets > --- > > Key: ARROW-886 > URL: https://issues.apache.org/jira/browse/ARROW-886 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Affects Versions: 0.3.0 >Reporter: Emilio Lahr-Vivaz >Assignee: Emilio Lahr-Vivaz > Fix For: 0.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1326) [Python] Fix Sphinx build in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111964#comment-16111964 ] Wes McKinney commented on ARROW-1326: - PR: https://github.com/apache/arrow/pull/936 > [Python] Fix Sphinx build in Travis CI > -- > > Key: ARROW-1326 > URL: https://issues.apache.org/jira/browse/ARROW-1326 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > This started happening at some point, but isn't failing the build: > https://travis-ci.org/apache/arrow/jobs/260259569 > {code} > running build_sphinx > creating /home/travis/build/apache/arrow/python/doc/_build > creating /home/travis/build/apache/arrow/python/doc/_build/doctrees > creating /home/travis/build/apache/arrow/python/doc/_build/html > Running Sphinx v1.6.3 > loading pickled environment... not yet created > [autosummary] generating autosummary for: api.rst, data.rst, development.rst, > filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, > memory.rst, pandas.rst, parquet.rst, plasma.rst > /home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56: > UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause > errors > /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40: > RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and > will be removed in Sphinx 1.7, please use docutils' instead. > RemovedInSphinx17Warning) > WARNING: [autosummary] failed to import 'pyarrow.Array': no module named > pyarrow.Array > WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named > pyarrow.ArrayValue > WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module > named pyarrow.BinaryArray > WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module > named pyarrow.BinaryValue > WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module > named pyarrow.BooleanArray > WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module > named pyarrow.BooleanValue > WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named > pyarrow.Buffer > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1295) [Plasma] Investigate test_plasma.py test failures in docker
[ https://issues.apache.org/jira/browse/ARROW-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1295: Fix Version/s: (was: 0.6.0) 0.7.0 > [Plasma] Investigate test_plasma.py test failures in docker > --- > > Key: ARROW-1295 > URL: https://issues.apache.org/jira/browse/ARROW-1295 > Project: Apache Arrow > Issue Type: Bug >Reporter: Philipp Moritz >Assignee: Philipp Moritz > Fix For: 0.7.0 > > > This happens in the manylinux build, see: > https://github.com/apache/arrow/pull/912 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1326) [Python] Fix Sphinx build in Travis CI
[ https://issues.apache.org/jira/browse/ARROW-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1326: --- Assignee: Wes McKinney > [Python] Fix Sphinx build in Travis CI > -- > > Key: ARROW-1326 > URL: https://issues.apache.org/jira/browse/ARROW-1326 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > This started happening at some point, but isn't failing the build: > https://travis-ci.org/apache/arrow/jobs/260259569 > {code} > running build_sphinx > creating /home/travis/build/apache/arrow/python/doc/_build > creating /home/travis/build/apache/arrow/python/doc/_build/doctrees > creating /home/travis/build/apache/arrow/python/doc/_build/html > Running Sphinx v1.6.3 > loading pickled environment... not yet created > [autosummary] generating autosummary for: api.rst, data.rst, development.rst, > filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, > memory.rst, pandas.rst, parquet.rst, plasma.rst > /home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56: > UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause > errors > /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40: > RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and > will be removed in Sphinx 1.7, please use docutils' instead. > RemovedInSphinx17Warning) > WARNING: [autosummary] failed to import 'pyarrow.Array': no module named > pyarrow.Array > WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named > pyarrow.ArrayValue > WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module > named pyarrow.BinaryArray > WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module > named pyarrow.BinaryValue > WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module > named pyarrow.BooleanArray > WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module > named pyarrow.BooleanValue > WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named > pyarrow.Buffer > > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-507) [C++/Python] Construct List container from offsets and values subarrays
[ https://issues.apache.org/jira/browse/ARROW-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-507: --- Fix Version/s: (was: 0.6.0) 0.7.0 > [C++/Python] Construct List container from offsets and values subarrays > --- > > Key: ARROW-507 > URL: https://issues.apache.org/jira/browse/ARROW-507 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.7.0 > > > This is the inverse operation from flattening a list type into its child > values (dropping the offsets) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations
[ https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953 ] Wes McKinney edited comment on ARROW-1302 at 8/3/17 12:06 AM: -- Would changing MAKE to CMAKE_MAKE_PROGRAM in our CMake scripts solve the problem? was (Author: wesmckinn): Would changing ${MAKE} to ${CMAKE_MAKE_PROGRAM} in our CMake scripts solve the problem? > C++: ${MAKE} variable not set sometimes on older MacOS installations > > > Key: ARROW-1302 > URL: https://issues.apache.org/jira/browse/ARROW-1302 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.5.0 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > Fix For: 0.6.0 > > > If the variable is not set, we may need to use `find_program` to detect make: > https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations
[ https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953 ] Wes McKinney commented on ARROW-1302: - Would changing {{${MAKE}}} to {{${CMAKE_MAKE_PROGRAM}}} in our CMake scripts solve the problem? > C++: ${MAKE} variable not set sometimes on older MacOS installations > > > Key: ARROW-1302 > URL: https://issues.apache.org/jira/browse/ARROW-1302 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.5.0 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > Fix For: 0.6.0 > > > If the variable is not set, we may need to use `find_program` to detect make: > https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations
[ https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953 ] Wes McKinney edited comment on ARROW-1302 at 8/3/17 12:06 AM: -- Would changing ${MAKE} to ${CMAKE_MAKE_PROGRAM} in our CMake scripts solve the problem? was (Author: wesmckinn): Would changing {{${MAKE}}} to {{${CMAKE_MAKE_PROGRAM}}} in our CMake scripts solve the problem? > C++: ${MAKE} variable not set sometimes on older MacOS installations > > > Key: ARROW-1302 > URL: https://issues.apache.org/jira/browse/ARROW-1302 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.5.0 >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn > Fix For: 0.6.0 > > > If the variable is not set, we may need to use `find_program` to detect make: > https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1116) [Python] Create single external GitHub repo building for building wheels for all platforms in one shot
[ https://issues.apache.org/jira/browse/ARROW-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1116: --- Assignee: Wes McKinney > [Python] Create single external GitHub repo building for building wheels for > all platforms in one shot > -- > > Key: ARROW-1116 > URL: https://issues.apache.org/jira/browse/ARROW-1116 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > * manylinux1 (for Linux) > * macOS > * Windows > We have all the machinery to do this, but we need to set things up to upload > to a single BinTray location > https://github.com/xhochy/pyarrow-macos-wheels > https://github.com/wesm/pyarrow-windows-wheels -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111935#comment-16111935 ] Wes McKinney commented on ARROW-1282: - Great, thank you. If you can provide a gdb backtrace from the hung process that would be helpful > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1272) [Python] Add script to arrow-dist to generate and upload manylinux1 Python wheels
[ https://issues.apache.org/jira/browse/ARROW-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1272: --- Assignee: Wes McKinney > [Python] Add script to arrow-dist to generate and upload manylinux1 Python > wheels > - > > Key: ARROW-1272 > URL: https://issues.apache.org/jira/browse/ARROW-1272 > Project: Apache Arrow > Issue Type: Task > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1307) [Python] Add pandas serialization section + Feather API to Sphinx docs
[ https://issues.apache.org/jira/browse/ARROW-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1307: Fix Version/s: (was: 0.6.0) 0.7.0 > [Python] Add pandas serialization section + Feather API to Sphinx docs > -- > > Key: ARROW-1307 > URL: https://issues.apache.org/jira/browse/ARROW-1307 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Fix For: 0.7.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects
[ https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1021: Fix Version/s: (was: 0.6.0) 0.7.0 > [Python] Add documentation about using pyarrow from other Cython and C++ > projects > - > > Key: ARROW-1021 > URL: https://issues.apache.org/jira/browse/ARROW-1021 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney > Fix For: 0.7.0 > > > Follow up work to ARROW-819, ARROW-714 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
[ https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1312. - Resolution: Fixed Issue resolved by pull request 935 [https://github.com/apache/arrow/pull/935] > [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved > - > > Key: ARROW-1312 > URL: https://issues.apache.org/jira/browse/ARROW-1312 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is > not good for users -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1270) [Packaging] Add Python wheel build scripts for macOS to arrow-dist
[ https://issues.apache.org/jira/browse/ARROW-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1270: --- Assignee: Wes McKinney (was: Uwe L. Korn) > [Packaging] Add Python wheel build scripts for macOS to arrow-dist > -- > > Key: ARROW-1270 > URL: https://issues.apache.org/jira/browse/ARROW-1270 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1270) [Packaging] Add Python wheel build scripts for macOS to arrow-dist
[ https://issues.apache.org/jira/browse/ARROW-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111914#comment-16111914 ] Wes McKinney commented on ARROW-1270: - PR: https://github.com/apache/arrow-dist/pull/2 > [Packaging] Add Python wheel build scripts for macOS to arrow-dist > -- > > Key: ARROW-1270 > URL: https://issues.apache.org/jira/browse/ARROW-1270 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111903#comment-16111903 ] Chris Bartak commented on ARROW-1282: - I'm still seeing my issue with `pyarrow==0.5.0.post2`, must be something else. Certainly possible that it's unrelated to pyarrow, though I thought I had it pretty well isolated. I'll open a new issue if I can get it reproducible. Thanks for the quick upload! > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1326) [Python] Fix Sphinx build in Travis CI
Wes McKinney created ARROW-1326: --- Summary: [Python] Fix Sphinx build in Travis CI Key: ARROW-1326 URL: https://issues.apache.org/jira/browse/ARROW-1326 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 0.6.0 This started happening at some point, but isn't failing the build: https://travis-ci.org/apache/arrow/jobs/260259569 {code} running build_sphinx creating /home/travis/build/apache/arrow/python/doc/_build creating /home/travis/build/apache/arrow/python/doc/_build/doctrees creating /home/travis/build/apache/arrow/python/doc/_build/html Running Sphinx v1.6.3 loading pickled environment... not yet created [autosummary] generating autosummary for: api.rst, data.rst, development.rst, filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, memory.rst, pandas.rst, parquet.rst, plasma.rst /home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56: UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause errors /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40: RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and will be removed in Sphinx 1.7, please use docutils' instead. RemovedInSphinx17Warning) WARNING: [autosummary] failed to import 'pyarrow.Array': no module named pyarrow.Array WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named pyarrow.ArrayValue WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module named pyarrow.BinaryArray WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module named pyarrow.BinaryValue WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module named pyarrow.BooleanArray WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module named pyarrow.BooleanValue WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named pyarrow.Buffer {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1309) pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
[ https://issues.apache.org/jira/browse/ARROW-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111899#comment-16111899 ] Wes McKinney commented on ARROW-1309: - Thanks [~virtualluke]. Any chance you can show the input data that triggered this error? There should be a single column in the data frame that is causing the problem (it's getting passed to {{pyarrow.Array.from_pandas}}) If it's not possible to fix this immediately, we would definitely want to make the error message more informative than that > pyarrow.lib.ArrowNotImplementedError: NotImplemented: null > -- > > Key: ARROW-1309 > URL: https://issues.apache.org/jira/browse/ARROW-1309 > Project: Apache Arrow > Issue Type: Bug > Environment: centos 7.3 >Reporter: Luke Higgins >Priority: Minor > Fix For: 0.6.0 > > > I have an avro file in hdfs that I am reading in using fastavro, converting > to a pandas dataframe and then trying to create an arrow table and get as > error: > >>> table=pyarrow.Table.from_pandas(my_dataframe) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 746, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:34089) > File "pyarrow/table.pxi", line 346, in pyarrow.lib._dataframe_to_arrays > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:30476) > File "pyarrow/array.pxi", line 182, in pyarrow.lib.Array.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:22110) > File "pyarrow/error.pxi", line 66, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:7702) > pyarrow.lib.ArrowNotImplementedError: NotImplemented: null > The avro schema indeed has null fields possible. Is this not implemented? I > am using pyarrow 0.5.0. Also, for what I am doing I am not using pandas at > all, I just read in the avro and I have a list of dicts and really want to > write them to disk in parquet format and am utilizing these steps (which > isn't optimal but may be necessary without writing more code of my own). > thanks, > Luke -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1309) pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
[ https://issues.apache.org/jira/browse/ARROW-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1309: Fix Version/s: 0.6.0 > pyarrow.lib.ArrowNotImplementedError: NotImplemented: null > -- > > Key: ARROW-1309 > URL: https://issues.apache.org/jira/browse/ARROW-1309 > Project: Apache Arrow > Issue Type: Bug > Environment: centos 7.3 >Reporter: Luke Higgins >Priority: Minor > Fix For: 0.6.0 > > > I have an avro file in hdfs that I am reading in using fastavro, converting > to a pandas dataframe and then trying to create an arrow table and get as > error: > >>> table=pyarrow.Table.from_pandas(my_dataframe) > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 746, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:34089) > File "pyarrow/table.pxi", line 346, in pyarrow.lib._dataframe_to_arrays > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:30476) > File "pyarrow/array.pxi", line 182, in pyarrow.lib.Array.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:22110) > File "pyarrow/error.pxi", line 66, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:7702) > pyarrow.lib.ArrowNotImplementedError: NotImplemented: null > The avro schema indeed has null fields possible. Is this not implemented? I > am using pyarrow 0.5.0. Also, for what I am doing I am not using pandas at > all, I just read in the avro and I have a list of dicts and really want to > write them to disk in parquet format and am utilizing these steps (which > isn't optimal but may be necessary without writing more code of my own). > thanks, > Luke -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1319) [Python] Add additional HDFS filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1319: Fix Version/s: 0.7.0 > [Python] Add additional HDFS filesystem methods > --- > > Key: ARROW-1319 > URL: https://issues.apache.org/jira/browse/ARROW-1319 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > Fix For: 0.7.0 > > > The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html > contains a wider set of file-system methods than arrow's python bindings. > These are probably simple to implement for arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1325) R language bindings for Arrow
[ https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Clark Fitzgerald reassigned ARROW-1325: --- Assignee: Clark Fitzgerald > R language bindings for Arrow > - > > Key: ARROW-1325 > URL: https://issues.apache.org/jira/browse/ARROW-1325 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Clark Fitzgerald >Assignee: Clark Fitzgerald > > The R language was designed to perform "Columnar in memory analytics". The > Arrow format could provide better compatibility between R and other big data > systems, as well as portable and efficient IO via Parquet. > Feather provides a starting point: > [https://github.com/wesm/feather/tree/master/R]. > This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1325) R language bindings for Arrow
[ https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111881#comment-16111881 ] Clark Fitzgerald commented on ARROW-1325: - Thanks! > R language bindings for Arrow > - > > Key: ARROW-1325 > URL: https://issues.apache.org/jira/browse/ARROW-1325 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Clark Fitzgerald > > The R language was designed to perform "Columnar in memory analytics". The > Arrow format could provide better compatibility between R and other big data > systems, as well as portable and efficient IO via Parquet. > Feather provides a starting point: > [https://github.com/wesm/feather/tree/master/R]. > This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-835) [Format] Add Timedelta type to describe time intervals
[ https://issues.apache.org/jira/browse/ARROW-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-835: --- Fix Version/s: (was: 0.6.0) 0.7.0 > [Format] Add Timedelta type to describe time intervals > -- > > Key: ARROW-835 > URL: https://issues.apache.org/jira/browse/ARROW-835 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jeff Reback >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.7.0 > > > xref https://github.com/apache/arrow/pull/551 and > https://github.com/apache/arrow/pull/551#issuecomment-294325969 > this will allow round-tripping of pandas ``Timedelta`` and numpy > ``timedelt64[ns]`` types. The will have a similar TimeUnit to TimestampType > (s, us, ms, ns). Possible impl include making this pure 64-bit. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1323) [GLib] Add garrow_boolean_array_get_values()
[ https://issues.apache.org/jira/browse/ARROW-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1323. - Resolution: Fixed Issue resolved by pull request 934 [https://github.com/apache/arrow/pull/934] > [GLib] Add garrow_boolean_array_get_values() > > > Key: ARROW-1323 > URL: https://issues.apache.org/jira/browse/ARROW-1323 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1315) [GLib] Status check of arrow::ArrayBuilder::Finish() is missing
[ https://issues.apache.org/jira/browse/ARROW-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1315. - Resolution: Fixed Issue resolved by pull request 933 [https://github.com/apache/arrow/pull/933] > [GLib] Status check of arrow::ArrayBuilder::Finish() is missing > --- > > Key: ARROW-1315 > URL: https://issues.apache.org/jira/browse/ARROW-1315 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
[ https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111837#comment-16111837 ] Wes McKinney commented on ARROW-1312: - PR: https://github.com/apache/arrow/pull/935 > [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved > - > > Key: ARROW-1312 > URL: https://issues.apache.org/jira/browse/ARROW-1312 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is > not good for users -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111836#comment-16111836 ] Wes McKinney commented on ARROW-1311: - Cool, thank you! And very sorry about the trouble. We would have learned about these problems with jemalloc earlier but we only made it the default allocator in 0.5.0 so it's good to know so we can work with the jemalloc developers to figure out what's wrong > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1325) R language bindings for Arrow
[ https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111832#comment-16111832 ] Wes McKinney commented on ARROW-1325: - Done. I also made you a Contributor so you can assign yourself issues in JIRA > R language bindings for Arrow > - > > Key: ARROW-1325 > URL: https://issues.apache.org/jira/browse/ARROW-1325 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Clark Fitzgerald > > The R language was designed to perform "Columnar in memory analytics". The > Arrow format could provide better compatibility between R and other big data > systems, as well as portable and efficient IO via Parquet. > Feather provides a starting point: > [https://github.com/wesm/feather/tree/master/R]. > This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1325) R language bindings for Arrow
[ https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1325: Component/s: R > R language bindings for Arrow > - > > Key: ARROW-1325 > URL: https://issues.apache.org/jira/browse/ARROW-1325 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Clark Fitzgerald > > The R language was designed to perform "Columnar in memory analytics". The > Arrow format could provide better compatibility between R and other big data > systems, as well as portable and efficient IO via Parquet. > Feather provides a starting point: > [https://github.com/wesm/feather/tree/master/R]. > This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111831#comment-16111831 ] Keith Curtis commented on ARROW-1311: - I re-ran my script with pyarrow-0.5.0.post2; that seemed to fixed it, my script ran smoothly converting 22 csv files to parquet format. Thanks! > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1325) R language bindings for Arrow
[ https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111819#comment-16111819 ] Clark Fitzgerald commented on ARROW-1325: - It would be nice to have an "R" component to categorize these issues. > R language bindings for Arrow > - > > Key: ARROW-1325 > URL: https://issues.apache.org/jira/browse/ARROW-1325 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Clark Fitzgerald > > The R language was designed to perform "Columnar in memory analytics". The > Arrow format could provide better compatibility between R and other big data > systems, as well as portable and efficient IO via Parquet. > Feather provides a starting point: > [https://github.com/wesm/feather/tree/master/R]. > This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111815#comment-16111815 ] Keith Curtis commented on ARROW-1311: - Ok, I'll re-try with post2 > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-989) [Python] Write pyarrow.Table to FileWriter or StreamWriter
[ https://issues.apache.org/jira/browse/ARROW-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-989: --- Fix Version/s: (was: 0.6.0) 0.7.0 > [Python] Write pyarrow.Table to FileWriter or StreamWriter > -- > > Key: ARROW-989 > URL: https://issues.apache.org/jira/browse/ARROW-989 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.7.0 > > > As part of this, we need to be able to get an iterator of record batches from > a table. We may want to write this iteration logic in C++ as it will be > generally useful. The chunking between columns may be different, so there is > some amount of complexity there -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
[ https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1312: --- Assignee: Wes McKinney > [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved > - > > Key: ARROW-1312 > URL: https://issues.apache.org/jira/browse/ARROW-1312 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is > not good for users -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111813#comment-16111813 ] Keith Curtis commented on ARROW-1311: - Hi, I think I have the updated one: $ pip install --upgrade pyarrow==0.5.* Collecting pyarrow==0.5.* Downloading pyarrow-0.5.0.post1-cp35-cp35m-manylinux1_x86_64.whl (8.9MB) ... I re-ran my script, but python appeared to hang, and the stack trace looks similar: #0 je_spin_adaptive (spin=) at include/jemalloc/internal/spin.h:40 #1 chunk_dss_max_update (new_addr=) at src/chunk_dss.c:83 #2 je_chunk_alloc_dss (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, new_addr=new_addr@entry=0x7f6c3300, size=size@entry=8388608, alignment=alignment@entry=2097152, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk_dss.c:122 #3 0x7f6ca92bb02f in chunk_alloc_core (dss_prec=dss_prec_secondary, commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c3300, arena=0x7f6ca8800140, tsdn=0x7f6d609ab620) at src/chunk.c:357 #4 chunk_alloc_default_impl (commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, new_addr=0x7f6c3300, arena=0x7f6ca8800140, tsdn=0x7f6d609ab620) at src/chunk.c:430 #5 je_chunk_alloc_wrapper (tsdn=tsdn@entry=0x7f6d609ab620, arena=arena@entry=0x7f6ca8800140, chunk_hooks=chunk_hooks@entry=0x7fff45db97c0, new_addr=new_addr@entry=0x7f6c3300, size=size@entry=8388608, alignment=2097152, sn=sn@entry=0x7fff45db97b0, zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at src/chunk.c:490 ... > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111811#comment-16111811 ] Wes McKinney commented on ARROW-1311: - Should be all set now with 0.5.0.post2 > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111798#comment-16111798 ] Wes McKinney commented on ARROW-1282: - I made a mistake in the build settings, will post a new set of binaries within a half hour or so. > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111796#comment-16111796 ] Wes McKinney commented on ARROW-1311: - Actually, I made a mistake in the build, and need to post another one, hang on for a few minutes. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111793#comment-16111793 ] Wes McKinney edited comment on ARROW-1282 at 8/2/17 10:01 PM: -- [~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because PyPI does not allow new builds with the same version number). If you install with {{pip install pyarrow==0.5.*}} then this issue should go away. Please let me know if not was (Author: wesmckinn): [~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because PyPI does not allow new builds with the same version number. If you install with {{pip install pyarrow==0.5.*}} then this issue should go away. Please let me know if not > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111793#comment-16111793 ] Wes McKinney commented on ARROW-1282: - [~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because PyPI does not allow new builds with the same version number. If you install with {{pip install pyarrow==0.5.*}} then this issue should go away. Please let me know if not > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111787#comment-16111787 ] Keith Curtis edited comment on ARROW-1311 at 8/2/17 10:00 PM: -- I re-ran my code, and have a revised function, where I added a line to update the column, which seems to matter. def to_parquet(output_file, csv_file): df = pd.read_csv(csv_file) df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']] table = pyarrow.Table.from_pandas(df) pq.write_table(table, output_file) When Python seemed hung (after 3 minutes with no progress), I captured a stack trace with gdb, and attached the file I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment using pip. was (Author: k94): I re-ran my code, and and have a revised function def to_parquet(output_file, csv_file): df = pd.read_csv(csv_file) df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']] table = pyarrow.Table.from_pandas(df) pq.write_table(table, output_file) When Python seemed hung (after 3 minutes with no progress), I captured a stack trace with gdb, and attached the file I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment using pip. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1311. - Resolution: Duplicate Assignee: Wes McKinney Same issue as ARROW-1282 > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis >Assignee: Wes McKinney > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111790#comment-16111790 ] Wes McKinney commented on ARROW-1311: - Thanks, indeed this is ARROW-1282. I'm in the process of updating 0.5.0 binaries to disable the jemalloc allocator. If you are using pip, can you try {{pip install pyarrow==0.5.*}} which should pull the {{0.5.0.post1}} updated build? If you are using conda, it will take me a little while to update the binaries on conda-forge. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111787#comment-16111787 ] Keith Curtis commented on ARROW-1311: - I re-ran my code, and and have a revised function def to_parquet(output_file, csv_file): df = pd.read_csv(csv_file) df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']] table = pyarrow.Table.from_pandas(df) pq.write_table(table, output_file) When Python seemed hung (after 3 minutes with no progress), I captured a stack trace with gdb, and attached the file I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment using pip. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1325) R language bindings for Arrow
Clark Fitzgerald created ARROW-1325: --- Summary: R language bindings for Arrow Key: ARROW-1325 URL: https://issues.apache.org/jira/browse/ARROW-1325 Project: Apache Arrow Issue Type: New Feature Reporter: Clark Fitzgerald The R language was designed to perform "Columnar in memory analytics". The Arrow format could provide better compatibility between R and other big data systems, as well as portable and efficient IO via Parquet. Feather provides a starting point: [https://github.com/wesm/feather/tree/master/R]. This can serve as an umbrella JIRA for work on R related tasks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Keith Curtis updated ARROW-1311: Attachment: backtrace.txt Stack trace from gdb when Python appeared to be hung. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis > Fix For: 0.6.0 > > Attachments: backtrace.txt > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage
[ https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111615#comment-16111615 ] Wes McKinney commented on ARROW-1292: - This work will be ongoing over the next couple releases > [C++/Python] Expand libhdfs feature coverage > > > Key: ARROW-1292 > URL: https://issues.apache.org/jira/browse/ARROW-1292 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.7.0 > > > Umbrella JIRA. Will create child issues for more granular tasks -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage
[ https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1292: Fix Version/s: (was: 0.6.0) 0.7.0 > [C++/Python] Expand libhdfs feature coverage > > > Key: ARROW-1292 > URL: https://issues.apache.org/jira/browse/ARROW-1292 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.7.0 > > > Umbrella JIRA. Will create child issues for more granular tasks -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111614#comment-16111614 ] Wes McKinney commented on ARROW-1282: - Moving this issue to 0.7.0 as it doesn't seem likely the underlying cause will be resolved in time for 0.6.0. I created ARROW-1312 to switch off the allocator by default to triage the situation. > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1282: Fix Version/s: (was: 0.6.0) 0.7.0 > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.7.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit
[ https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phillip Cloud reassigned ARROW-786: --- Assignee: Phillip Cloud > [Format] In-memory format for 128-bit Decimals, handling of sign bit > > > Key: ARROW-786 > URL: https://issues.apache.org/jira/browse/ARROW-786 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney >Assignee: Phillip Cloud > Fix For: 0.7.0 > > > cc [~cpcloud] > We found in ARROW-655 that we needed to add an extra bit for signedness for > decimals stored as 128-bit values to be able to use the Boost multiprecision > libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed > size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java > implementation? We will need to document the memory layout for decimals that > maximizes compatibility across languages and eventually implement integration > tests for IPC. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111541#comment-16111541 ] Wes McKinney commented on ARROW-1282: - OK. I'm going to get to work putting up patched 0.5.0 builds on PyPI and conda-forge since these issues persisting is not acceptable. We should still figure out what is happening in jemalloc to cause this but it may take a little while > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.6.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc
[ https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111371#comment-16111371 ] Chris Bartak commented on ARROW-1282: - I've run into a problem that I assume has to be this issue, unfortunately I can't quite get it down to a reproducible example, but I'll share the context in case it's helpful. ec2 2017.03 Amazon Linux AMI (Red Hat 4.8.3-9) python3.4 pyarrow==0.5 (from pip + deps) Reading a very small parquet file, 3kb - two text columns. Interactively seems to always work. Serving a webapp with `httpd`/`mod_wsgi` and Flask that reads the same file - almost always (but not always!) it completely hangs. No spike in CPU/memory > Large memory reallocation by Arrow causes hang in jemalloc > -- > > Key: ARROW-1282 > URL: https://issues.apache.org/jira/browse/ARROW-1282 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jeff Knupp > Fix For: 0.6.0 > > > When reallocating a large amount of memory, Arrow is either triggering a bug > in jemalloc or has a bug itself in the memory manager (many different > applications reporting same issue but not clear from jemalloc issue > description if they're sure it's in jemalloc or caused by other issues like > using multiple memory allocation libraries in the same process, multithreaded > access, etc). > Link to stack trace is here: > https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef > Link to issue in jemalloc GitHub is here: > https://github.com/jemalloc/jemalloc/issues/802 > Originally observed in redis, discussed with jemalloc maintainer here: > https://github.com/antirez/redis/issues/3799 > *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version > 3.6.0 according to `apt` metadata.* -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1319) [Python] Add additional HDFS filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111211#comment-16111211 ] Martin Durant commented on ARROW-1319: -- Methods that I don't think exist, and some that have different names (and maybe already aliased). Not all are "basic filesystem" operations. 'delegate_token' 'disconnect' (maybe not required) 'get' = download 'get_block_locations', 'getmerge' (many remote files to one local file) 'glob', 'head', 'makedirs', 'mv' = rename 'put' = upload 'read_block' (delimited read) 'renew_token', 'rm' = delete 'set_replication', 'tail', 'touch' On files: readlines/iteration (maybe better with io.TextIOWrapper); flush?; not sure if all standard file methods are there (readable, read1...) Methods implemented in unreleased hdfs3: 'cancel_token', 'concat' (limited to whole blocks for hadoop 1.6) 'create_encryption_zone', 'list_encryption_zones', > [Python] Add additional HDFS filesystem methods > --- > > Key: ARROW-1319 > URL: https://issues.apache.org/jira/browse/ARROW-1319 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html > contains a wider set of file-system methods than arrow's python bindings. > These are probably simple to implement for arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1690#comment-1690 ] Wes McKinney commented on ARROW-1311: - We could release patched builds on PyPI but there is the performance regression ARROW-1290. I may update 0.5.0 on conda-forge to include this patch and disable jemalloc for now > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis > Fix For: 0.6.0 > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables
[ https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1656#comment-1656 ] Uwe L. Korn commented on ARROW-1311: [~wesmckinn] We should simply disable {{jemalloc}}Â by default until these problems have been resolved. I will try to reproduce locally and then talk to the jemalloc people to get it fixed upstream. > python hangs after write a few parquet tables > - > > Key: ARROW-1311 > URL: https://issues.apache.org/jira/browse/ARROW-1311 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.5.0 > Environment: Python 3.5.2, pyarrow 0.5.0 >Reporter: Keith Curtis > Fix For: 0.6.0 > > > I had a program to read some csv files (a few million rows each, 9 columns), > and converted with: > ```python > import os > import pandas as pd > import pyarrow.parquet as pq > import pyarrow > def to_parquet(output_file, csv_file): > df = pd.read_csv(csv_file) > table = pyarrow.Table.from_pandas(df) > pq.write_table(table, output_file) > ``` > The first csv file would always complete, but python would hang on the second > or third file, and sometimes on a much later file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1211) [C++] Consider making default_memory_pool() the default for builder classes
[ https://issues.apache.org/jira/browse/ARROW-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1211. - Resolution: Fixed Resolved by https://github.com/apache/arrow/commit/ee928d2233da89ebd1f567ffda4833f4f07e795c > [C++] Consider making default_memory_pool() the default for builder classes > --- > > Key: ARROW-1211 > URL: https://issues.apache.org/jira/browse/ARROW-1211 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney > Fix For: 0.6.0 > > > To make this work, we would also need to make {{MemoryPool*}} the last > argument in some of the builder constructors. @xhochy what do you think? > see also ARROW-1210 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1305) [GLib] Add GArrowIntArrayBuilder
[ https://issues.apache.org/jira/browse/ARROW-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1305. - Resolution: Fixed Issue resolved by pull request 928 [https://github.com/apache/arrow/pull/928] > [GLib] Add GArrowIntArrayBuilder > > > Key: ARROW-1305 > URL: https://issues.apache.org/jira/browse/ARROW-1305 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1324) [C++] Support ExternalProject build of required Boost components on MSVC
Wes McKinney created ARROW-1324: --- Summary: [C++] Support ExternalProject build of required Boost components on MSVC Key: ARROW-1324 URL: https://issues.apache.org/jira/browse/ARROW-1324 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Follow up to ARROW-1303 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (ARROW-1303) [C++] Support downloading Boost
[ https://issues.apache.org/jira/browse/ARROW-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1303. - Resolution: Fixed Issue resolved by pull request 927 [https://github.com/apache/arrow/pull/927] > [C++] Support downloading Boost > --- > > Key: ARROW-1303 > URL: https://issues.apache.org/jira/browse/ARROW-1303 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Fix For: 0.6.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1262) [Packaging] Packaging automation in arrow-dist
[ https://issues.apache.org/jira/browse/ARROW-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1617#comment-1617 ] Wes McKinney commented on ARROW-1262: - Marked for 0.7.0. Don't think this will be completed in time for 0.6.0 > [Packaging] Packaging automation in arrow-dist > -- > > Key: ARROW-1262 > URL: https://issues.apache.org/jira/browse/ARROW-1262 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Wes McKinney > Fix For: 0.7.0 > > > This JIRA is an umbrella JIRA for tasks to streamline our binary builds at > release time as much as possible. We may also be able to set up nightly > builds for testing -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1262) [Packaging] Packaging automation in arrow-dist
[ https://issues.apache.org/jira/browse/ARROW-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1262: Fix Version/s: 0.7.0 > [Packaging] Packaging automation in arrow-dist > -- > > Key: ARROW-1262 > URL: https://issues.apache.org/jira/browse/ARROW-1262 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Reporter: Wes McKinney > Fix For: 0.7.0 > > > This JIRA is an umbrella JIRA for tasks to streamline our binary builds at > release time as much as possible. We may also be able to set up nightly > builds for testing -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-352) Interval(DAY_TIME) has no unit
[ https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1615#comment-1615 ] Wes McKinney commented on ARROW-352: Moving off 0.6.0 as this will require some discussion > Interval(DAY_TIME) has no unit > -- > > Key: ARROW-352 > URL: https://issues.apache.org/jira/browse/ARROW-352 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Reporter: Julien Le Dem >Assignee: Wes McKinney > Fix For: 0.7.0 > > > Interval(DATE_TIME) assumes milliseconds. > we should have a time unit like timestamp. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-352) Interval(DAY_TIME) has no unit
[ https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-352: --- Fix Version/s: (was: 0.6.0) 0.7.0 > Interval(DAY_TIME) has no unit > -- > > Key: ARROW-352 > URL: https://issues.apache.org/jira/browse/ARROW-352 > Project: Apache Arrow > Issue Type: Bug > Components: Format >Reporter: Julien Le Dem >Assignee: Wes McKinney > Fix For: 0.7.0 > > > Interval(DATE_TIME) assumes milliseconds. > we should have a time unit like timestamp. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1234) [Java] publishing nightly snapshot java artifacts to maven repo
[ https://issues.apache.org/jira/browse/ARROW-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1605#comment-1605 ] Wes McKinney commented on ARROW-1234: - I believe you need to be a PMC or Committer to set this up. > [Java] publishing nightly snapshot java artifacts to maven repo > --- > > Key: ARROW-1234 > URL: https://issues.apache.org/jira/browse/ARROW-1234 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Memory, Java - Vectors >Affects Versions: 0.5.0, 0.6.0, 1.0.0 > Environment: CI >Reporter: Antony Mayi > Attachments: arrow_development_deploy.xml > > > The [Snapshot > repository|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/] > doesn't seem to be getting any recent snapshot builds. Could this be > established for the sake of easier integration? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1234) [Java] publishing nightly snapshot java artifacts to maven repo
[ https://issues.apache.org/jira/browse/ARROW-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111099#comment-16111099 ] Li Jin commented on ARROW-1234: --- I was trying to figure out permission issues such as what account has permission to publish to [ASF repo|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/] and what account has permission to access ASF jenkins to set up the job. Maybe [~julienledem] can shed some light? > [Java] publishing nightly snapshot java artifacts to maven repo > --- > > Key: ARROW-1234 > URL: https://issues.apache.org/jira/browse/ARROW-1234 > Project: Apache Arrow > Issue Type: Improvement > Components: Java - Memory, Java - Vectors >Affects Versions: 0.5.0, 0.6.0, 1.0.0 > Environment: CI >Reporter: Antony Mayi > Attachments: arrow_development_deploy.xml > > > The [Snapshot > repository|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/] > doesn't seem to be getting any recent snapshot builds. Could this be > established for the sake of easier integration? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1317) [Python] Add function to set Hadoop CLASSPATH
[ https://issues.apache.org/jira/browse/ARROW-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111085#comment-16111085 ] Wes McKinney commented on ARROW-1317: - My understanding is that you can set {{CLASSPATH}} in {{os.environ}} prior to JNI bootstrap. A patch would be welcome > [Python] Add function to set Hadoop CLASSPATH > -- > > Key: ARROW-1317 > URL: https://issues.apache.org/jira/browse/ARROW-1317 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > Getting access to hdfs via libhdfs requires the setting of several > environment variables. > Many of these paths should be auto-detectable requiring less or perhaps even > no information from the user. This would lower the access barrier to hdfs for > a non-dev user. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1323) [GLib] Add garrow_boolean_array_get_values()
Kouhei Sutou created ARROW-1323: --- Summary: [GLib] Add garrow_boolean_array_get_values() Key: ARROW-1323 URL: https://issues.apache.org/jira/browse/ARROW-1323 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou Priority: Minor Fix For: 0.6.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1317) [Python] Add function to set Hadoop CLASSPATH
[ https://issues.apache.org/jira/browse/ARROW-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1317: Summary: [Python] Add function to set Hadoop CLASSPATH (was: hdfs environment variables) > [Python] Add function to set Hadoop CLASSPATH > -- > > Key: ARROW-1317 > URL: https://issues.apache.org/jira/browse/ARROW-1317 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > Getting access to hdfs via libhdfs requires the setting of several > environment variables. > Many of these paths should be auto-detectable requiring less or perhaps even > no information from the user. This would lower the access barrier to hdfs for > a non-dev user. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1319) [Python] Add additional HDFS filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111084#comment-16111084 ] Wes McKinney commented on ARROW-1319: - Quite a few of them were added in ARROW-1301. Can you make a list of which additional ones are needed (that are not accounted for by other JIRAs already)? > [Python] Add additional HDFS filesystem methods > --- > > Key: ARROW-1319 > URL: https://issues.apache.org/jira/browse/ARROW-1319 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html > contains a wider set of file-system methods than arrow's python bindings. > These are probably simple to implement for arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1319) [Python] Add additional HDFS filesystem methods
[ https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1319: Summary: [Python] Add additional HDFS filesystem methods (was: hdfs methods) > [Python] Add additional HDFS filesystem methods > --- > > Key: ARROW-1319 > URL: https://issues.apache.org/jira/browse/ARROW-1319 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html > contains a wider set of file-system methods than arrow's python bindings. > These are probably simple to implement for arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1322) hdfs: encryption-at-rest and secure transport
Martin Durant created ARROW-1322: Summary: hdfs: encryption-at-rest and secure transport Key: ARROW-1322 URL: https://issues.apache.org/jira/browse/ARROW-1322 Project: Apache Arrow Issue Type: Wish Reporter: Martin Durant HDFS provides for encrypted data transfer and encryption of data on-disc (e.g., via KMS records). It would be nice to see these available within arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1318) [C++] hdfs access with auth
[ https://issues.apache.org/jira/browse/ARROW-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1318: Summary: [C++] hdfs access with auth (was: hdfs access with auth) > [C++] hdfs access with auth > --- > > Key: ARROW-1318 > URL: https://issues.apache.org/jira/browse/ARROW-1318 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Martin Durant > > A wide variety of authentication schemes are available in hadoop. > This issue is to track whether libhdfs can successfully operate with them. > The list includes: > - user/password > - basic kerberos (via kinit and via keytabs) > - kerberos with active directory and single-sign-on > - "privacy" and "integrity" modes > - access with hdfs delegation token > - probably others... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1318) [C++] hdfs access with auth
[ https://issues.apache.org/jira/browse/ARROW-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1318: Component/s: C++ > [C++] hdfs access with auth > --- > > Key: ARROW-1318 > URL: https://issues.apache.org/jira/browse/ARROW-1318 > Project: Apache Arrow > Issue Type: Test > Components: C++ >Reporter: Martin Durant > > A wide variety of authentication schemes are available in hadoop. > This issue is to track whether libhdfs can successfully operate with them. > The list includes: > - user/password > - basic kerberos (via kinit and via keytabs) > - kerberos with active directory and single-sign-on > - "privacy" and "integrity" modes > - access with hdfs delegation token > - probably others... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1321) hdfs delegation token functions
Martin Durant created ARROW-1321: Summary: hdfs delegation token functions Key: ARROW-1321 URL: https://issues.apache.org/jira/browse/ARROW-1321 Project: Apache Arrow Issue Type: Improvement Reporter: Martin Durant HDFS can create delegation tokens for an authenticated user, so that access to the file-system from other processes/machines can authenticate as that same user without having to use third-party identity systems (kerberos, etc.). arrow-hdfs should provide the ability to accept, create, renew and cancel delegation tokens. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (ARROW-1320) hdfs block locations
[ https://issues.apache.org/jira/browse/ARROW-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-1320. --- Resolution: Duplicate Duplicate of ARROW-473 > hdfs block locations > > > Key: ARROW-1320 > URL: https://issues.apache.org/jira/browse/ARROW-1320 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Martin Durant > > To provide a function which can return the set of machines on which the data > blocks of a given hdfs file are stored. This is best for scheduling systems > (e.g., dask) which can move the computation to the machine which has the > data, and so cut out network data traffic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1316) hdfs connector stand-alone
[ https://issues.apache.org/jira/browse/ARROW-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111075#comment-16111075 ] Wes McKinney commented on ARROW-1316: - I am not sure this is possible. To use libhdfs to access an HDFS cluster, you need: * A JVM installation * The Hadoop client libraries in your classpath * File system-like API for the libhdfs library These are provided respectively by the JDK install, the Hadoop install, and the Arrow libraries. The Arrow interface to HDFS provides a consistent API as other files (https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h). This is the same approach used in TensorFlow (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.h) and other projects. > hdfs connector stand-alone > -- > > Key: ARROW-1316 > URL: https://issues.apache.org/jira/browse/ARROW-1316 > Project: Apache Arrow > Issue Type: Wish >Reporter: Martin Durant > > Currently, access to hdfs via libhdfs requires the whole of arrow, a java > installation and a hadoop installation. This setup is indeed common, such as > on "cluster edge-nodes". > This issue is posted with the wish that hdfs file-system access could be done > without needing the whole set of installations, above. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1320) hdfs block locations
Martin Durant created ARROW-1320: Summary: hdfs block locations Key: ARROW-1320 URL: https://issues.apache.org/jira/browse/ARROW-1320 Project: Apache Arrow Issue Type: Improvement Reporter: Martin Durant To provide a function which can return the set of machines on which the data blocks of a given hdfs file are stored. This is best for scheduling systems (e.g., dask) which can move the computation to the machine which has the data, and so cut out network data traffic. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1319) hdfs methods
Martin Durant created ARROW-1319: Summary: hdfs methods Key: ARROW-1319 URL: https://issues.apache.org/jira/browse/ARROW-1319 Project: Apache Arrow Issue Type: Improvement Reporter: Martin Durant The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html contains a wider set of file-system methods than arrow's python bindings. These are probably simple to implement for arrow-hdfs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1318) hdfs access with auth
Martin Durant created ARROW-1318: Summary: hdfs access with auth Key: ARROW-1318 URL: https://issues.apache.org/jira/browse/ARROW-1318 Project: Apache Arrow Issue Type: Test Reporter: Martin Durant A wide variety of authentication schemes are available in hadoop. This issue is to track whether libhdfs can successfully operate with them. The list includes: - user/password - basic kerberos (via kinit and via keytabs) - kerberos with active directory and single-sign-on - "privacy" and "integrity" modes - access with hdfs delegation token - probably others... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1317) hdfs environment variables
Martin Durant created ARROW-1317: Summary: hdfs environment variables Key: ARROW-1317 URL: https://issues.apache.org/jira/browse/ARROW-1317 Project: Apache Arrow Issue Type: Improvement Reporter: Martin Durant Getting access to hdfs via libhdfs requires the setting of several environment variables. Many of these paths should be auto-detectable requiring less or perhaps even no information from the user. This would lower the access barrier to hdfs for a non-dev user. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1313) [C++/Python] Add troubleshooting section for setting up HDFS JNI interface
[ https://issues.apache.org/jira/browse/ARROW-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111067#comment-16111067 ] Wes McKinney commented on ARROW-1313: - My understanding is that the safest thing to do in production is use the libhdfs.so that is shipped with a particular Hadoop distribution (since there may be internal details that are particular to that version of Hadoop); while the public C API is the same between versions, in theory there could be internal details in the JNI implementation that break the Java "ABI". The Hadoop community would be able to give better advice > [C++/Python] Add troubleshooting section for setting up HDFS JNI interface > -- > > Key: ARROW-1313 > URL: https://issues.apache.org/jira/browse/ARROW-1313 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation > Environment: linux trusty-cdh5 >Reporter: Martin Durant > Fix For: 0.6.0 > > > The hadoop library directory contains a libhdfs.a and a libhadoop.so but no > libhdfs.so. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1316) hdfs connector stand-alone
Martin Durant created ARROW-1316: Summary: hdfs connector stand-alone Key: ARROW-1316 URL: https://issues.apache.org/jira/browse/ARROW-1316 Project: Apache Arrow Issue Type: Wish Reporter: Martin Durant Currently, access to hdfs via libhdfs requires the whole of arrow, a java installation and a hadoop installation. This setup is indeed common, such as on "cluster edge-nodes". This issue is posted with the wish that hdfs file-system access could be done without needing the whole set of installations, above. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit
[ https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111057#comment-16111057 ] Wes McKinney commented on ARROW-786: OK, sweet, that would be awesome. > [Format] In-memory format for 128-bit Decimals, handling of sign bit > > > Key: ARROW-786 > URL: https://issues.apache.org/jira/browse/ARROW-786 > Project: Apache Arrow > Issue Type: Improvement > Components: Format >Reporter: Wes McKinney > Fix For: 0.7.0 > > > cc [~cpcloud] > We found in ARROW-655 that we needed to add an extra bit for signedness for > decimals stored as 128-bit values to be able to use the Boost multiprecision > libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed > size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java > implementation? We will need to document the memory layout for decimals that > maximizes compatibility across languages and eventually implement integration > tests for IPC. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111048#comment-16111048 ] Wes McKinney commented on ARROW-1314: - Note that the {{pyarrow.hdfs}} namespace is new in 0.6.0 (releasing in next couple of weeks), to connect with <= 0.5.0, use {{pyarrow.HdfsClient}} > [C++] Provide installation guidance for macOS users who wish to use JNI-based > HDFS interface > > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1314: Summary: [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface (was: libhdfs installation didn't work - mac) > [C++] Provide installation guidance for macOS users who wish to use JNI-based > HDFS interface > > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1314: Component/s: Documentation > [C++] Provide installation guidance for macOS users who wish to use JNI-based > HDFS interface > > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1315) [GLib] Status check of arrow::ArrayBuilder::Finish() is missing
Kouhei Sutou created ARROW-1315: --- Summary: [GLib] Status check of arrow::ArrayBuilder::Finish() is missing Key: ARROW-1315 URL: https://issues.apache.org/jira/browse/ARROW-1315 Project: Apache Arrow Issue Type: Improvement Components: GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou Priority: Minor Fix For: 0.6.0 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111043#comment-16111043 ] Wes McKinney commented on ARROW-1314: - I don't think Linux shared libraries (like libhdfs.so, libhdfs3.so) can be loaded on Mac. So libhdfs needs to be compiled for the macOS architecture. It looks like some other projects have documented this; we could go through the exercise and add it to the project documentation: https://github.com/forward/node-hdfs#mac-osx > libhdfs installation didn't work - mac > -- > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111041#comment-16111041 ] Martin Durant commented on ARROW-1314: -- It is the general distribution, e.g., http://mirror.csclub.uwaterloo.ca/apache/hadoop/common/ (which is, of course, just java). If the answer is "you shouldn't run hadoop on mac", I understand; however, I did get hdfs3 working with this distro. > libhdfs installation didn't work - mac > -- > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac
[ https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111021#comment-16111021 ] Wes McKinney commented on ARROW-1314: - Where did you obtain the Hadoop distribution for Mac? > libhdfs installation didn't work - mac > -- > > Key: ARROW-1314 > URL: https://issues.apache.org/jira/browse/ARROW-1314 > Project: Apache Arrow > Issue Type: Improvement > Environment: mac 10.12.6 >Reporter: Martin Durant > > Having set > HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1 (straight download, does > contain libhdfs.so in native) > java openjdk version "1.8.0_121" in anaconda install directory > and CLASSPATH as in the docs (too long to show) > ``` > In [3]: pa.hdfs > --- > AttributeErrorTraceback (most recent call last) > in () > > 1 pa.hdfs > AttributeError: module 'pyarrow' has no attribute 'hdfs' > In [4]: pa.have_libhdfs() > Out[4]: False > In [5]: pa.have_libhdfs3() > Out[5]: False > ``` > (I also have libhdfs3.so - not .dylib - but it is not found even if included > in DYLD_FALLBACK_LIBRARY_PATH) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1313) [C++/Python] Add troubleshooting section for setting up HDFS JNI interface
[ https://issues.apache.org/jira/browse/ARROW-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111018#comment-16111018 ] Martin Durant commented on ARROW-1313: -- That would install the whole of hadoop as system packages, so there would be two separate ones with the CHD install from before. libhdfs.so is only 200kB, can it not be distributed? > [C++/Python] Add troubleshooting section for setting up HDFS JNI interface > -- > > Key: ARROW-1313 > URL: https://issues.apache.org/jira/browse/ARROW-1313 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation > Environment: linux trusty-cdh5 >Reporter: Martin Durant > Fix For: 0.6.0 > > > The hadoop library directory contains a libhdfs.a and a libhadoop.so but no > libhdfs.so. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (ARROW-1296) [Java] templates/FixValueVectors reset() method doesn't set allocationSizeInBytes correctly
[ https://issues.apache.org/jira/browse/ARROW-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin reassigned ARROW-1296: - Assignee: Li Jin > [Java] templates/FixValueVectors reset() method doesn't set > allocationSizeInBytes correctly > --- > > Key: ARROW-1296 > URL: https://issues.apache.org/jira/browse/ARROW-1296 > Project: Apache Arrow > Issue Type: Bug > Components: Java - Vectors >Affects Versions: 0.5.0 >Reporter: Li Jin >Assignee: Li Jin > Fix For: 0.6.0 > > > [~siddteotia] pointed out reset() in templates/FixValueVectors.java should > set: > {code} > allocationSizeInBytes = INITIAL_VALUE_ALLOCATION * ${type.width} > {code} > instead of: > {code} > allocationSizeInBytes = INITIAL_VALUE_ALLOCATION > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)