[jira] [Updated] (ARROW-1237) [JAVA] Expose the ability to set lastSet

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SIDDHARTH TEOTIA updated ARROW-1237:

Summary: [JAVA] Expose the ability to set lastSet   (was: Expose the 
ability to set lastSet )

> [JAVA] Expose the ability to set lastSet 
> -
>
> Key: ARROW-1237
> URL: https://issues.apache.org/jira/browse/ARROW-1237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: SIDDHARTH TEOTIA
>Assignee: SIDDHARTH TEOTIA
>Priority: Minor
> Fix For: 0.6.0
>
>
> Expose the ability to set lastSet on vectors such that 
> Mutator.setValueCount() doesn't blow away the vector if we have previously 
> loaded it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-276) [JAVA] Nullable Value Vectors should extend BaseValueVector instead of BaseDataValueVector

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SIDDHARTH TEOTIA updated ARROW-276:
---
Summary: [JAVA] Nullable Value Vectors should extend BaseValueVector 
instead of BaseDataValueVector  (was: [JAVA] )

> [JAVA] Nullable Value Vectors should extend BaseValueVector instead of 
> BaseDataValueVector
> --
>
> Key: ARROW-276
> URL: https://issues.apache.org/jira/browse/ARROW-276
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Julien Le Dem
>Assignee: SIDDHARTH TEOTIA
> Fix For: 0.6.0
>
>
> Currently Nullable Vectors have an unused data vector because of this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112039#comment-16112039
 ] 

SIDDHARTH TEOTIA edited comment on ARROW-886 at 8/3/17 1:39 AM:


[~elahrvivaz],

As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are 
you fine with undoing the changes?

(1) Don't explicitly reallocate the offsetVector in realloc() function of 
Variable Length Vectors.
(2) Doing (1) will break the unit test added as part of PR 591 so we need to 
remove that as well.

I have created a PR  for the above two items -- 
https://github.com/apache/arrow/pull/937 this basically reverts your change for 
the above items.

Thanks,
Siddharth


was (Author: siddteotia):
[~elahrvivaz],

As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are 
you fine with undoing the changes:

(1) Don't explicitly reallocate the offsetVector in realloc() function of 
Variable Length Vectors.
(2) Doing (1) will break the unit test added as part of PR 591 so we need to 
remove that as well.

I have created a PR  for the above two items -- 
https://github.com/apache/arrow/pull/937 this basically reverts your change for 
the above items.

Thanks,
Siddharth

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1249) [JAVA] Expose the fillEmpties function from NullableVector.mutator

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SIDDHARTH TEOTIA updated ARROW-1249:

Summary: [JAVA] Expose the fillEmpties function from 
NullableVector.mutator  (was: Expose the fillEmpties function from 
NullableVector.mutator)

> [JAVA] Expose the fillEmpties function from NullableVector.mutator
> -
>
> Key: ARROW-1249
> URL: https://issues.apache.org/jira/browse/ARROW-1249
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: SIDDHARTH TEOTIA
>Assignee: SIDDHARTH TEOTIA
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1310) [JAVA] Revert ARROW-886

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SIDDHARTH TEOTIA updated ARROW-1310:

Summary: [JAVA] Revert ARROW-886  (was: Revert ARROW-886)

> [JAVA] Revert ARROW-886
> ---
>
> Key: ARROW-1310
> URL: https://issues.apache.org/jira/browse/ARROW-1310
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: SIDDHARTH TEOTIA
>Assignee: SIDDHARTH TEOTIA
>
> We don't need to reallocate the underlying offsetVector every time a variable 
> length vector is reallocated. 
> Reallocation of offsetVector is taken care of by setSafe() function of the 
> offsetVector. 
> The setSafe() function of the Variable Length Vector will decide whether to 
> call realloc() or not. However, this should not decide whether offsetVector 
> needs reallocation or not. When setSafe() calls offsetVector.setSafe(), the 
> latter can decide whether to reallocate the offset vector or not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-276) [JAVA]

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SIDDHARTH TEOTIA updated ARROW-276:
---
Summary: [JAVA]   (was: Nullable Vectors should extend BaseValueVector and 
not BaseDataValueVector)

> [JAVA] 
> ---
>
> Key: ARROW-276
> URL: https://issues.apache.org/jira/browse/ARROW-276
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Julien Le Dem
>Assignee: SIDDHARTH TEOTIA
> Fix For: 0.6.0
>
>
> Currently Nullable Vectors have an unused data vector because of this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-886) VariableLengthVectors don't reAlloc offsets

2017-08-02 Thread SIDDHARTH TEOTIA (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112039#comment-16112039
 ] 

SIDDHARTH TEOTIA commented on ARROW-886:


[~elahrvivaz],

As far as the commit https://github.com/apache/arrow/pull/591 is concerned, are 
you fine with undoing the changes:

(1) Don't explicitly reallocate the offsetVector in realloc() function of 
Variable Length Vectors.
(2) Doing (1) will break the unit test added as part of PR 591 so we need to 
remove that as well.

I have created a PR  for the above two items -- 
https://github.com/apache/arrow/pull/937 this basically reverts your change for 
the above items.

Thanks,
Siddharth

> VariableLengthVectors don't reAlloc offsets
> ---
>
> Key: ARROW-886
> URL: https://issues.apache.org/jira/browse/ARROW-886
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.3.0
>Reporter: Emilio Lahr-Vivaz
>Assignee: Emilio Lahr-Vivaz
> Fix For: 0.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1326) [Python] Fix Sphinx build in Travis CI

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111964#comment-16111964
 ] 

Wes McKinney commented on ARROW-1326:
-

PR: https://github.com/apache/arrow/pull/936

> [Python] Fix Sphinx build in Travis CI
> --
>
> Key: ARROW-1326
> URL: https://issues.apache.org/jira/browse/ARROW-1326
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> This started happening at some point, but isn't failing the build:
> https://travis-ci.org/apache/arrow/jobs/260259569
> {code}
> running build_sphinx
> creating /home/travis/build/apache/arrow/python/doc/_build
> creating /home/travis/build/apache/arrow/python/doc/_build/doctrees
> creating /home/travis/build/apache/arrow/python/doc/_build/html
> Running Sphinx v1.6.3
> loading pickled environment... not yet created
> [autosummary] generating autosummary for: api.rst, data.rst, development.rst, 
> filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, 
> memory.rst, pandas.rst, parquet.rst, plasma.rst
> /home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56:
>  UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause 
> errors
> /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40:
>  RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and 
> will be removed in Sphinx 1.7, please use docutils' instead.
>   RemovedInSphinx17Warning)
> WARNING: [autosummary] failed to import 'pyarrow.Array': no module named 
> pyarrow.Array
> WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named 
> pyarrow.ArrayValue
> WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module 
> named pyarrow.BinaryArray
> WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module 
> named pyarrow.BinaryValue
> WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module 
> named pyarrow.BooleanArray
> WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module 
> named pyarrow.BooleanValue
> WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named 
> pyarrow.Buffer
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1295) [Plasma] Investigate test_plasma.py test failures in docker

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1295:

Fix Version/s: (was: 0.6.0)
   0.7.0

> [Plasma] Investigate test_plasma.py test failures in docker
> ---
>
> Key: ARROW-1295
> URL: https://issues.apache.org/jira/browse/ARROW-1295
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
> Fix For: 0.7.0
>
>
> This happens in the manylinux build, see:
> https://github.com/apache/arrow/pull/912



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1326) [Python] Fix Sphinx build in Travis CI

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1326:
---

Assignee: Wes McKinney

> [Python] Fix Sphinx build in Travis CI
> --
>
> Key: ARROW-1326
> URL: https://issues.apache.org/jira/browse/ARROW-1326
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> This started happening at some point, but isn't failing the build:
> https://travis-ci.org/apache/arrow/jobs/260259569
> {code}
> running build_sphinx
> creating /home/travis/build/apache/arrow/python/doc/_build
> creating /home/travis/build/apache/arrow/python/doc/_build/doctrees
> creating /home/travis/build/apache/arrow/python/doc/_build/html
> Running Sphinx v1.6.3
> loading pickled environment... not yet created
> [autosummary] generating autosummary for: api.rst, data.rst, development.rst, 
> filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, 
> memory.rst, pandas.rst, parquet.rst, plasma.rst
> /home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56:
>  UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause 
> errors
> /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40:
>  RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and 
> will be removed in Sphinx 1.7, please use docutils' instead.
>   RemovedInSphinx17Warning)
> WARNING: [autosummary] failed to import 'pyarrow.Array': no module named 
> pyarrow.Array
> WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named 
> pyarrow.ArrayValue
> WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module 
> named pyarrow.BinaryArray
> WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module 
> named pyarrow.BinaryValue
> WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module 
> named pyarrow.BooleanArray
> WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module 
> named pyarrow.BooleanValue
> WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named 
> pyarrow.Buffer
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-507) [C++/Python] Construct List container from offsets and values subarrays

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-507:
---
Fix Version/s: (was: 0.6.0)
   0.7.0

> [C++/Python] Construct List container from offsets and values subarrays
> ---
>
> Key: ARROW-507
> URL: https://issues.apache.org/jira/browse/ARROW-507
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> This is the inverse operation from flattening a list type into its child 
> values (dropping the offsets)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953
 ] 

Wes McKinney edited comment on ARROW-1302 at 8/3/17 12:06 AM:
--

Would changing MAKE to CMAKE_MAKE_PROGRAM in our CMake scripts solve the 
problem?


was (Author: wesmckinn):
Would changing ${MAKE} to ${CMAKE_MAKE_PROGRAM} in our CMake scripts solve the 
problem?

> C++: ${MAKE} variable not set sometimes on older MacOS installations
> 
>
> Key: ARROW-1302
> URL: https://issues.apache.org/jira/browse/ARROW-1302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.5.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: 0.6.0
>
>
> If the variable is not set, we may need to use `find_program` to detect make: 
> https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953
 ] 

Wes McKinney commented on ARROW-1302:
-

Would changing {{${MAKE}}} to {{${CMAKE_MAKE_PROGRAM}}} in our CMake scripts 
solve the problem?

> C++: ${MAKE} variable not set sometimes on older MacOS installations
> 
>
> Key: ARROW-1302
> URL: https://issues.apache.org/jira/browse/ARROW-1302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.5.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: 0.6.0
>
>
> If the variable is not set, we may need to use `find_program` to detect make: 
> https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1302) C++: ${MAKE} variable not set sometimes on older MacOS installations

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111953#comment-16111953
 ] 

Wes McKinney edited comment on ARROW-1302 at 8/3/17 12:06 AM:
--

Would changing ${MAKE} to ${CMAKE_MAKE_PROGRAM} in our CMake scripts solve the 
problem?


was (Author: wesmckinn):
Would changing {{${MAKE}}} to {{${CMAKE_MAKE_PROGRAM}}} in our CMake scripts 
solve the problem?

> C++: ${MAKE} variable not set sometimes on older MacOS installations
> 
>
> Key: ARROW-1302
> URL: https://issues.apache.org/jira/browse/ARROW-1302
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.5.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
> Fix For: 0.6.0
>
>
> If the variable is not set, we may need to use `find_program` to detect make: 
> https://travis-ci.org/xhochy/pyarrow-macos-wheels/builds/259750110



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1116) [Python] Create single external GitHub repo building for building wheels for all platforms in one shot

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1116:
---

Assignee: Wes McKinney

> [Python] Create single external GitHub repo building for building wheels for 
> all platforms in one shot
> --
>
> Key: ARROW-1116
> URL: https://issues.apache.org/jira/browse/ARROW-1116
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> * manylinux1 (for Linux)
> * macOS
> * Windows
> We have all the machinery to do this, but we need to set things up to upload 
> to a single BinTray location
> https://github.com/xhochy/pyarrow-macos-wheels
> https://github.com/wesm/pyarrow-windows-wheels



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111935#comment-16111935
 ] 

Wes McKinney commented on ARROW-1282:
-

Great, thank you. If you can provide a gdb backtrace from the hung process that 
would be helpful

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1272) [Python] Add script to arrow-dist to generate and upload manylinux1 Python wheels

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1272:
---

Assignee: Wes McKinney

> [Python] Add script to arrow-dist to generate and upload manylinux1 Python 
> wheels
> -
>
> Key: ARROW-1272
> URL: https://issues.apache.org/jira/browse/ARROW-1272
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1307) [Python] Add pandas serialization section + Feather API to Sphinx docs

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1307:

Fix Version/s: (was: 0.6.0)
   0.7.0

> [Python] Add pandas serialization section + Feather API to Sphinx docs
> --
>
> Key: ARROW-1307
> URL: https://issues.apache.org/jira/browse/ARROW-1307
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1021) [Python] Add documentation about using pyarrow from other Cython and C++ projects

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1021:

Fix Version/s: (was: 0.6.0)
   0.7.0

> [Python] Add documentation about using pyarrow from other Cython and C++ 
> projects
> -
>
> Key: ARROW-1021
> URL: https://issues.apache.org/jira/browse/ARROW-1021
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> Follow up work to ARROW-819, ARROW-714



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1312.
-
Resolution: Fixed

Issue resolved by pull request 935
[https://github.com/apache/arrow/pull/935]

> [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
> -
>
> Key: ARROW-1312
> URL: https://issues.apache.org/jira/browse/ARROW-1312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is 
> not good for users



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1270) [Packaging] Add Python wheel build scripts for macOS to arrow-dist

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1270:
---

Assignee: Wes McKinney  (was: Uwe L. Korn)

> [Packaging] Add Python wheel build scripts for macOS to arrow-dist
> --
>
> Key: ARROW-1270
> URL: https://issues.apache.org/jira/browse/ARROW-1270
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1270) [Packaging] Add Python wheel build scripts for macOS to arrow-dist

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111914#comment-16111914
 ] 

Wes McKinney commented on ARROW-1270:
-

PR: https://github.com/apache/arrow-dist/pull/2

> [Packaging] Add Python wheel build scripts for macOS to arrow-dist
> --
>
> Key: ARROW-1270
> URL: https://issues.apache.org/jira/browse/ARROW-1270
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Chris Bartak (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111903#comment-16111903
 ] 

Chris Bartak commented on ARROW-1282:
-

I'm still seeing my issue with `pyarrow==0.5.0.post2`, must be something else.  
Certainly possible that it's unrelated to pyarrow, though I thought I had it 
pretty well isolated.  I'll open  a new issue if I can get it reproducible.  
Thanks for the quick upload!

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1326) [Python] Fix Sphinx build in Travis CI

2017-08-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1326:
---

 Summary: [Python] Fix Sphinx build in Travis CI
 Key: ARROW-1326
 URL: https://issues.apache.org/jira/browse/ARROW-1326
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.6.0


This started happening at some point, but isn't failing the build:

https://travis-ci.org/apache/arrow/jobs/260259569

{code}
running build_sphinx
creating /home/travis/build/apache/arrow/python/doc/_build
creating /home/travis/build/apache/arrow/python/doc/_build/doctrees
creating /home/travis/build/apache/arrow/python/doc/_build/html
Running Sphinx v1.6.3
loading pickled environment... not yet created
[autosummary] generating autosummary for: api.rst, data.rst, development.rst, 
filesystems.rst, getting_involved.rst, index.rst, install.rst, ipc.rst, 
memory.rst, pandas.rst, parquet.rst, plasma.rst
/home/travis/build/apache/arrow/python/.eggs/setuptools_scm-1.15.6-py3.6.egg/setuptools_scm/git.py:56:
 UserWarning: "/home/travis/build/apache/arrow" is shallow and may cause errors
/home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/sphinx/util/compat.py:40:
 RemovedInSphinx17Warning: sphinx.util.compat.Directive is deprecated and will 
be removed in Sphinx 1.7, please use docutils' instead.
  RemovedInSphinx17Warning)
WARNING: [autosummary] failed to import 'pyarrow.Array': no module named 
pyarrow.Array
WARNING: [autosummary] failed to import 'pyarrow.ArrayValue': no module named 
pyarrow.ArrayValue
WARNING: [autosummary] failed to import 'pyarrow.BinaryArray': no module named 
pyarrow.BinaryArray
WARNING: [autosummary] failed to import 'pyarrow.BinaryValue': no module named 
pyarrow.BinaryValue
WARNING: [autosummary] failed to import 'pyarrow.BooleanArray': no module named 
pyarrow.BooleanArray
WARNING: [autosummary] failed to import 'pyarrow.BooleanValue': no module named 
pyarrow.BooleanValue
WARNING: [autosummary] failed to import 'pyarrow.Buffer': no module named 
pyarrow.Buffer

{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1309) pyarrow.lib.ArrowNotImplementedError: NotImplemented: null

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111899#comment-16111899
 ] 

Wes McKinney commented on ARROW-1309:
-

Thanks [~virtualluke]. Any chance you can show the input data that triggered 
this error? There should be a single column in the data frame that is causing 
the problem (it's getting passed to {{pyarrow.Array.from_pandas}})

If it's not possible to fix this immediately, we would definitely want to make 
the error message more informative than that

> pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
> --
>
> Key: ARROW-1309
> URL: https://issues.apache.org/jira/browse/ARROW-1309
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: centos 7.3
>Reporter: Luke Higgins
>Priority: Minor
> Fix For: 0.6.0
>
>
> I have an avro file in hdfs that I am reading in using fastavro, converting 
> to a pandas dataframe and then trying to create an arrow table and get as 
> error:
> >>> table=pyarrow.Table.from_pandas(my_dataframe)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 746, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:34089)
>   File "pyarrow/table.pxi", line 346, in pyarrow.lib._dataframe_to_arrays 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:30476)
>   File "pyarrow/array.pxi", line 182, in pyarrow.lib.Array.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:22110)
>   File "pyarrow/error.pxi", line 66, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:7702)
> pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
> The avro schema indeed has null fields possible.  Is this not implemented?  I 
> am using pyarrow 0.5.0.  Also, for what I am doing I am not using pandas at 
> all, I just read in the avro and I have a list of dicts and really want to 
> write them to disk in parquet format and am utilizing these steps (which 
> isn't optimal but may be necessary without writing more code of my own).
> thanks,
> Luke



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1309) pyarrow.lib.ArrowNotImplementedError: NotImplemented: null

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1309:

Fix Version/s: 0.6.0

> pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
> --
>
> Key: ARROW-1309
> URL: https://issues.apache.org/jira/browse/ARROW-1309
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: centos 7.3
>Reporter: Luke Higgins
>Priority: Minor
> Fix For: 0.6.0
>
>
> I have an avro file in hdfs that I am reading in using fastavro, converting 
> to a pandas dataframe and then trying to create an arrow table and get as 
> error:
> >>> table=pyarrow.Table.from_pandas(my_dataframe)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 746, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:34089)
>   File "pyarrow/table.pxi", line 346, in pyarrow.lib._dataframe_to_arrays 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:30476)
>   File "pyarrow/array.pxi", line 182, in pyarrow.lib.Array.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:22110)
>   File "pyarrow/error.pxi", line 66, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:7702)
> pyarrow.lib.ArrowNotImplementedError: NotImplemented: null
> The avro schema indeed has null fields possible.  Is this not implemented?  I 
> am using pyarrow 0.5.0.  Also, for what I am doing I am not using pandas at 
> all, I just read in the avro and I have a list of dicts and really want to 
> write them to disk in parquet format and am utilizing these steps (which 
> isn't optimal but may be necessary without writing more code of my own).
> thanks,
> Luke



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1319) [Python] Add additional HDFS filesystem methods

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1319:

Fix Version/s: 0.7.0

> [Python] Add additional HDFS filesystem methods
> ---
>
> Key: ARROW-1319
> URL: https://issues.apache.org/jira/browse/ARROW-1319
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
> Fix For: 0.7.0
>
>
> The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
> contains a wider set of file-system methods than arrow's python bindings. 
> These are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Clark Fitzgerald (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Clark Fitzgerald reassigned ARROW-1325:
---

Assignee: Clark Fitzgerald

> R language bindings for Arrow
> -
>
> Key: ARROW-1325
> URL: https://issues.apache.org/jira/browse/ARROW-1325
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Clark Fitzgerald
>Assignee: Clark Fitzgerald
>
> The R language was designed to perform "Columnar in memory analytics". The 
> Arrow format could provide better compatibility between R and other big data 
> systems, as well as portable and efficient IO via Parquet.
> Feather provides a starting point: 
> [https://github.com/wesm/feather/tree/master/R].
> This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Clark Fitzgerald (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111881#comment-16111881
 ] 

Clark Fitzgerald commented on ARROW-1325:
-

Thanks!

> R language bindings for Arrow
> -
>
> Key: ARROW-1325
> URL: https://issues.apache.org/jira/browse/ARROW-1325
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Clark Fitzgerald
>
> The R language was designed to perform "Columnar in memory analytics". The 
> Arrow format could provide better compatibility between R and other big data 
> systems, as well as portable and efficient IO via Parquet.
> Feather provides a starting point: 
> [https://github.com/wesm/feather/tree/master/R].
> This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-835) [Format] Add Timedelta type to describe time intervals

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-835:
---
Fix Version/s: (was: 0.6.0)
   0.7.0

> [Format] Add Timedelta type to describe time intervals
> --
>
> Key: ARROW-835
> URL: https://issues.apache.org/jira/browse/ARROW-835
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jeff Reback
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.7.0
>
>
> xref https://github.com/apache/arrow/pull/551 and 
> https://github.com/apache/arrow/pull/551#issuecomment-294325969
> this will allow round-tripping of pandas ``Timedelta`` and numpy 
> ``timedelt64[ns]`` types. The will have a similar TimeUnit to TimestampType 
> (s, us, ms, ns). Possible impl include making this pure 64-bit.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1323) [GLib] Add garrow_boolean_array_get_values()

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1323.
-
Resolution: Fixed

Issue resolved by pull request 934
[https://github.com/apache/arrow/pull/934]

> [GLib] Add garrow_boolean_array_get_values()
> 
>
> Key: ARROW-1323
> URL: https://issues.apache.org/jira/browse/ARROW-1323
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1315) [GLib] Status check of arrow::ArrayBuilder::Finish() is missing

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1315.
-
Resolution: Fixed

Issue resolved by pull request 933
[https://github.com/apache/arrow/pull/933]

> [GLib] Status check of arrow::ArrayBuilder::Finish() is missing
> ---
>
> Key: ARROW-1315
> URL: https://issues.apache.org/jira/browse/ARROW-1315
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111837#comment-16111837
 ] 

Wes McKinney commented on ARROW-1312:
-

PR: https://github.com/apache/arrow/pull/935

> [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
> -
>
> Key: ARROW-1312
> URL: https://issues.apache.org/jira/browse/ARROW-1312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is 
> not good for users



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111836#comment-16111836
 ] 

Wes McKinney commented on ARROW-1311:
-

Cool, thank you! And very sorry about the trouble. We would have learned about 
these problems with jemalloc earlier but we only made it the default allocator 
in 0.5.0 so it's good to know so we can work with the jemalloc developers to 
figure out what's wrong

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111832#comment-16111832
 ] 

Wes McKinney commented on ARROW-1325:
-

Done. I also made you a Contributor so you can assign yourself issues in JIRA

> R language bindings for Arrow
> -
>
> Key: ARROW-1325
> URL: https://issues.apache.org/jira/browse/ARROW-1325
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Clark Fitzgerald
>
> The R language was designed to perform "Columnar in memory analytics". The 
> Arrow format could provide better compatibility between R and other big data 
> systems, as well as portable and efficient IO via Parquet.
> Feather provides a starting point: 
> [https://github.com/wesm/feather/tree/master/R].
> This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1325:

Component/s: R

> R language bindings for Arrow
> -
>
> Key: ARROW-1325
> URL: https://issues.apache.org/jira/browse/ARROW-1325
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Clark Fitzgerald
>
> The R language was designed to perform "Columnar in memory analytics". The 
> Arrow format could provide better compatibility between R and other big data 
> systems, as well as portable and efficient IO via Parquet.
> Feather provides a starting point: 
> [https://github.com/wesm/feather/tree/master/R].
> This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111831#comment-16111831
 ] 

Keith Curtis commented on ARROW-1311:
-

I re-ran my script with pyarrow-0.5.0.post2; that seemed to fixed it, my script 
ran smoothly converting 22 csv files to parquet format. Thanks!

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Clark Fitzgerald (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111819#comment-16111819
 ] 

Clark Fitzgerald commented on ARROW-1325:
-

It would be nice to have an "R" component to categorize these issues.

> R language bindings for Arrow
> -
>
> Key: ARROW-1325
> URL: https://issues.apache.org/jira/browse/ARROW-1325
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Clark Fitzgerald
>
> The R language was designed to perform "Columnar in memory analytics". The 
> Arrow format could provide better compatibility between R and other big data 
> systems, as well as portable and efficient IO via Parquet.
> Feather provides a starting point: 
> [https://github.com/wesm/feather/tree/master/R].
> This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111815#comment-16111815
 ] 

Keith Curtis commented on ARROW-1311:
-

Ok, I'll re-try with post2

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-989) [Python] Write pyarrow.Table to FileWriter or StreamWriter

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-989:
---
Fix Version/s: (was: 0.6.0)
   0.7.0

> [Python] Write pyarrow.Table to FileWriter or StreamWriter
> --
>
> Key: ARROW-989
> URL: https://issues.apache.org/jira/browse/ARROW-989
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> As part of this, we need to be able to get an iterator of record batches from 
> a table. We may want to write this iteration logic in C++ as it will be 
> generally useful. The chunking between columns may be different, so there is 
> some amount of complexity there



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1312) [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1312:
---

Assignee: Wes McKinney

> [C++] Set default value to ARROW_JEMALLOC to OFF until ARROW-1282 is resolved
> -
>
> Key: ARROW-1312
> URL: https://issues.apache.org/jira/browse/ARROW-1312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> The current failure mode when there is an issue (ARROW-1282, ARROW-1311) is 
> not good for users



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111813#comment-16111813
 ] 

Keith Curtis commented on ARROW-1311:
-

Hi,  I think I have the updated one:
  $ pip install --upgrade  pyarrow==0.5.*
Collecting pyarrow==0.5.*
  Downloading pyarrow-0.5.0.post1-cp35-cp35m-manylinux1_x86_64.whl (8.9MB)
  ...

I re-ran my script, but python appeared to hang, and the stack trace looks 
similar:

#0  je_spin_adaptive (spin=) at 
include/jemalloc/internal/spin.h:40
#1  chunk_dss_max_update (new_addr=) at src/chunk_dss.c:83
#2  je_chunk_alloc_dss (tsdn=tsdn@entry=0x7f6d609ab620, 
arena=arena@entry=0x7f6ca8800140, new_addr=new_addr@entry=0x7f6c3300, 
size=size@entry=8388608, 
alignment=alignment@entry=2097152, zero=zero@entry=0x7fff45db9850, 
commit=commit@entry=0x7fff45db97a0) at src/chunk_dss.c:122
#3  0x7f6ca92bb02f in chunk_alloc_core (dss_prec=dss_prec_secondary, 
commit=0x7fff45db97a0, zero=0x7fff45db9850, alignment=2097152, size=8388608, 
new_addr=0x7f6c3300, 
arena=0x7f6ca8800140, tsdn=0x7f6d609ab620) at src/chunk.c:357
#4  chunk_alloc_default_impl (commit=0x7fff45db97a0, zero=0x7fff45db9850, 
alignment=2097152, size=8388608, new_addr=0x7f6c3300, arena=0x7f6ca8800140, 
tsdn=0x7f6d609ab620)
at src/chunk.c:430
#5  je_chunk_alloc_wrapper (tsdn=tsdn@entry=0x7f6d609ab620, 
arena=arena@entry=0x7f6ca8800140, chunk_hooks=chunk_hooks@entry=0x7fff45db97c0, 
new_addr=new_addr@entry=0x7f6c3300, 
size=size@entry=8388608, alignment=2097152, sn=sn@entry=0x7fff45db97b0, 
zero=zero@entry=0x7fff45db9850, commit=commit@entry=0x7fff45db97a0) at 
src/chunk.c:490
 ...


> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111811#comment-16111811
 ] 

Wes McKinney commented on ARROW-1311:
-

Should be all set now with 0.5.0.post2

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111798#comment-16111798
 ] 

Wes McKinney commented on ARROW-1282:
-

I made a mistake in the build settings, will post a new set of binaries within 
a half hour or so. 

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111796#comment-16111796
 ] 

Wes McKinney commented on ARROW-1311:
-

Actually, I made a mistake in the build, and need to post another one, hang on 
for a few minutes.

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111793#comment-16111793
 ] 

Wes McKinney edited comment on ARROW-1282 at 8/2/17 10:01 PM:
--

[~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because 
PyPI does not allow new builds with the same version number). If you install 
with

{{pip install pyarrow==0.5.*}}

then this issue should go away. Please let me know if not


was (Author: wesmckinn):
[~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because 
PyPI does not allow new builds with the same version number. If you install with

{{pip install pyarrow==0.5.*}}

then this issue should go away. Please let me know if not

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111793#comment-16111793
 ] 

Wes McKinney commented on ARROW-1282:
-

[~chrisb1] I have just posted a patched build on PyPI (0.5.0.post1, because 
PyPI does not allow new builds with the same version number. If you install with

{{pip install pyarrow==0.5.*}}

then this issue should go away. Please let me know if not

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111787#comment-16111787
 ] 

Keith Curtis edited comment on ARROW-1311 at 8/2/17 10:00 PM:
--

I re-ran my code, and have a revised function, where I added a line to update 
the column, which seems to matter.

def to_parquet(output_file, csv_file):
df = pd.read_csv(csv_file)
df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
table = pyarrow.Table.from_pandas(df)
pq.write_table(table, output_file)

When Python seemed hung (after 3 minutes with no progress), I captured a stack 
trace with gdb, and attached the file

I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment 
using pip.



was (Author: k94):
I re-ran my code, and and have a revised function

def to_parquet(output_file, csv_file):
df = pd.read_csv(csv_file)
df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
table = pyarrow.Table.from_pandas(df)
pq.write_table(table, output_file)

When Python seemed hung (after 3 minutes with no progress), I captured a stack 
trace with gdb, and attached the file

I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment 
using pip.


> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1311.
-
Resolution: Duplicate
  Assignee: Wes McKinney

Same issue as ARROW-1282

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111790#comment-16111790
 ] 

Wes McKinney commented on ARROW-1311:
-

Thanks, indeed this is ARROW-1282. I'm in the process of updating 0.5.0 
binaries to disable the jemalloc allocator. 

If you are using pip, can you try {{pip install pyarrow==0.5.*}} which should 
pull the {{0.5.0.post1}} updated build? If you are using conda, it will take me 
a little while to update the binaries on conda-forge.

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111787#comment-16111787
 ] 

Keith Curtis commented on ARROW-1311:
-

I re-ran my code, and and have a revised function

def to_parquet(output_file, csv_file):
df = pd.read_csv(csv_file)
df['gecco_variant'] = [ v.lstrip('0') for v in df['gecco_variant']]
table = pyarrow.Table.from_pandas(df)
pq.write_table(table, output_file)

When Python seemed hung (after 3 minutes with no progress), I captured a stack 
trace with gdb, and attached the file

I'm running on Ubuntu 14.04.3. I installed into a conda virtual environment 
using pip.


> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1325) R language bindings for Arrow

2017-08-02 Thread Clark Fitzgerald (JIRA)
Clark Fitzgerald created ARROW-1325:
---

 Summary: R language bindings for Arrow
 Key: ARROW-1325
 URL: https://issues.apache.org/jira/browse/ARROW-1325
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Clark Fitzgerald


The R language was designed to perform "Columnar in memory analytics". The 
Arrow format could provide better compatibility between R and other big data 
systems, as well as portable and efficient IO via Parquet.

Feather provides a starting point: 
[https://github.com/wesm/feather/tree/master/R].

This can serve as an umbrella JIRA for work on R related tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Keith Curtis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Curtis updated ARROW-1311:

Attachment: backtrace.txt

Stack trace from gdb when Python appeared to be hung.  

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
> Fix For: 0.6.0
>
> Attachments: backtrace.txt
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111615#comment-16111615
 ] 

Wes McKinney commented on ARROW-1292:
-

This work will be ongoing over the next couple releases

> [C++/Python] Expand libhdfs feature coverage
> 
>
> Key: ARROW-1292
> URL: https://issues.apache.org/jira/browse/ARROW-1292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> Umbrella JIRA. Will create child issues for more granular tasks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1292) [C++/Python] Expand libhdfs feature coverage

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1292:

Fix Version/s: (was: 0.6.0)
   0.7.0

> [C++/Python] Expand libhdfs feature coverage
> 
>
> Key: ARROW-1292
> URL: https://issues.apache.org/jira/browse/ARROW-1292
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> Umbrella JIRA. Will create child issues for more granular tasks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111614#comment-16111614
 ] 

Wes McKinney commented on ARROW-1282:
-

Moving this issue to 0.7.0 as it doesn't seem likely the underlying cause will 
be resolved in time for 0.6.0. I created ARROW-1312 to switch off the allocator 
by default to triage the situation. 

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1282:

Fix Version/s: (was: 0.6.0)
   0.7.0

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.7.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-08-02 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud reassigned ARROW-786:
---

Assignee: Phillip Cloud

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
>Assignee: Phillip Cloud
> Fix For: 0.7.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111541#comment-16111541
 ] 

Wes McKinney commented on ARROW-1282:
-

OK. I'm going to get to work putting up patched 0.5.0 builds on PyPI and 
conda-forge since these issues persisting is not acceptable. We should still 
figure out what is happening in jemalloc to cause this but it may take a little 
while

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.6.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-08-02 Thread Chris Bartak (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111371#comment-16111371
 ] 

Chris Bartak commented on ARROW-1282:
-

I've run into a problem that I assume has to be this issue, unfortunately I 
can't quite get it down to a reproducible example, but I'll share the context 
in case it's helpful.

ec2 2017.03 Amazon Linux AMI  (Red Hat 4.8.3-9)
python3.4
pyarrow==0.5  (from pip + deps)

Reading a very small parquet file, 3kb - two text columns.  Interactively seems 
to always work.  Serving a webapp with `httpd`/`mod_wsgi` and Flask that reads 
the same file - almost always (but not always!) it completely hangs.  No spike 
in CPU/memory

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
> Fix For: 0.6.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1319) [Python] Add additional HDFS filesystem methods

2017-08-02 Thread Martin Durant (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111211#comment-16111211
 ] 

Martin Durant commented on ARROW-1319:
--

Methods that I don't think exist, and some that have different names (and maybe 
already aliased). Not all are "basic filesystem" operations.
 'delegate_token'
 'disconnect' (maybe not required)
 'get' = download
 'get_block_locations',
 'getmerge' (many remote files to one local file)
 'glob',
 'head',
 'makedirs',
 'mv' = rename
 'put' = upload
 'read_block' (delimited read)
 'renew_token',
 'rm' = delete
 'set_replication',
 'tail',
 'touch'

On files: readlines/iteration (maybe better with io.TextIOWrapper); flush?; not 
sure if all standard file methods are there (readable, read1...)

Methods implemented in unreleased hdfs3:
 'cancel_token',
 'concat' (limited to whole blocks for hadoop 1.6)
 'create_encryption_zone',
 'list_encryption_zones',


> [Python] Add additional HDFS filesystem methods
> ---
>
> Key: ARROW-1319
> URL: https://issues.apache.org/jira/browse/ARROW-1319
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
> contains a wider set of file-system methods than arrow's python bindings. 
> These are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1690#comment-1690
 ] 

Wes McKinney commented on ARROW-1311:
-

We could release patched builds on PyPI but there is the performance regression 
ARROW-1290. I may update 0.5.0 on conda-forge to include this patch and disable 
jemalloc for now

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
> Fix For: 0.6.0
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1311) python hangs after write a few parquet tables

2017-08-02 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1656#comment-1656
 ] 

Uwe L. Korn commented on ARROW-1311:


[~wesmckinn] We should simply disable {{jemalloc}} by default until these 
problems have been resolved. I will try to reproduce locally and then talk to 
the jemalloc people to get it fixed upstream.

> python hangs after write a few parquet tables
> -
>
> Key: ARROW-1311
> URL: https://issues.apache.org/jira/browse/ARROW-1311
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.5.0
> Environment: Python 3.5.2, pyarrow 0.5.0
>Reporter: Keith Curtis
> Fix For: 0.6.0
>
>
> I had a program to read some csv files (a few million rows each, 9 columns), 
> and converted with:
> ```python
> import os
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow
> def to_parquet(output_file, csv_file):
> df = pd.read_csv(csv_file)
> table = pyarrow.Table.from_pandas(df)
> pq.write_table(table, output_file)
> ```
> The first csv file would always complete, but python would hang on the second 
> or third file, and sometimes on a much later file.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1211) [C++] Consider making default_memory_pool() the default for builder classes

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1211.
-
Resolution: Fixed

Resolved by 
https://github.com/apache/arrow/commit/ee928d2233da89ebd1f567ffda4833f4f07e795c

> [C++] Consider making default_memory_pool() the default for builder classes
> ---
>
> Key: ARROW-1211
> URL: https://issues.apache.org/jira/browse/ARROW-1211
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: 0.6.0
>
>
> To make this work, we would also need to make {{MemoryPool*}} the last 
> argument in some of the builder constructors. @xhochy what do you think?
> see also ARROW-1210



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1305) [GLib] Add GArrowIntArrayBuilder

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1305.
-
Resolution: Fixed

Issue resolved by pull request 928
[https://github.com/apache/arrow/pull/928]

> [GLib] Add GArrowIntArrayBuilder
> 
>
> Key: ARROW-1305
> URL: https://issues.apache.org/jira/browse/ARROW-1305
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1324) [C++] Support ExternalProject build of required Boost components on MSVC

2017-08-02 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1324:
---

 Summary: [C++] Support ExternalProject build of required Boost 
components on MSVC
 Key: ARROW-1324
 URL: https://issues.apache.org/jira/browse/ARROW-1324
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney


Follow up to ARROW-1303



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1303) [C++] Support downloading Boost

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1303.
-
Resolution: Fixed

Issue resolved by pull request 927
[https://github.com/apache/arrow/pull/927]

> [C++] Support downloading Boost
> ---
>
> Key: ARROW-1303
> URL: https://issues.apache.org/jira/browse/ARROW-1303
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1262) [Packaging] Packaging automation in arrow-dist

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1617#comment-1617
 ] 

Wes McKinney commented on ARROW-1262:
-

Marked for 0.7.0. Don't think this will be completed in time for 0.6.0

> [Packaging] Packaging automation in arrow-dist
> --
>
> Key: ARROW-1262
> URL: https://issues.apache.org/jira/browse/ARROW-1262
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> This JIRA is an umbrella JIRA for tasks to streamline our binary builds at 
> release time as much as possible. We may also be able to set up nightly 
> builds for testing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1262) [Packaging] Packaging automation in arrow-dist

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1262:

Fix Version/s: 0.7.0

> [Packaging] Packaging automation in arrow-dist
> --
>
> Key: ARROW-1262
> URL: https://issues.apache.org/jira/browse/ARROW-1262
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> This JIRA is an umbrella JIRA for tasks to streamline our binary builds at 
> release time as much as possible. We may also be able to set up nightly 
> builds for testing



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-352) Interval(DAY_TIME) has no unit

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1615#comment-1615
 ] 

Wes McKinney commented on ARROW-352:


Moving off 0.6.0 as this will require some discussion

> Interval(DAY_TIME) has no unit
> --
>
> Key: ARROW-352
> URL: https://issues.apache.org/jira/browse/ARROW-352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Reporter: Julien Le Dem
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> Interval(DATE_TIME) assumes milliseconds.
> we should have a time unit like timestamp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-352) Interval(DAY_TIME) has no unit

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-352:
---
Fix Version/s: (was: 0.6.0)
   0.7.0

> Interval(DAY_TIME) has no unit
> --
>
> Key: ARROW-352
> URL: https://issues.apache.org/jira/browse/ARROW-352
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Format
>Reporter: Julien Le Dem
>Assignee: Wes McKinney
> Fix For: 0.7.0
>
>
> Interval(DATE_TIME) assumes milliseconds.
> we should have a time unit like timestamp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1234) [Java] publishing nightly snapshot java artifacts to maven repo

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1605#comment-1605
 ] 

Wes McKinney commented on ARROW-1234:
-

I believe you need to be a PMC or Committer to set this up. 

> [Java] publishing nightly snapshot java artifacts to maven repo
> ---
>
> Key: ARROW-1234
> URL: https://issues.apache.org/jira/browse/ARROW-1234
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Affects Versions: 0.5.0, 0.6.0, 1.0.0
> Environment: CI
>Reporter: Antony Mayi
> Attachments: arrow_development_deploy.xml
>
>
> The [Snapshot 
> repository|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/]
>  doesn't seem to be getting any recent snapshot builds. Could this be 
> established for the sake of easier integration?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1234) [Java] publishing nightly snapshot java artifacts to maven repo

2017-08-02 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111099#comment-16111099
 ] 

Li Jin commented on ARROW-1234:
---

I was trying to figure out permission issues such as what account has 
permission to publish to [ASF 
repo|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/] 
and what account has permission to access ASF jenkins to set up the job. Maybe 
[~julienledem] can shed some light?

> [Java] publishing nightly snapshot java artifacts to maven repo
> ---
>
> Key: ARROW-1234
> URL: https://issues.apache.org/jira/browse/ARROW-1234
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Affects Versions: 0.5.0, 0.6.0, 1.0.0
> Environment: CI
>Reporter: Antony Mayi
> Attachments: arrow_development_deploy.xml
>
>
> The [Snapshot 
> repository|https://repository.apache.org/content/groups/snapshots/org/apache/arrow/]
>  doesn't seem to be getting any recent snapshot builds. Could this be 
> established for the sake of easier integration?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1317) [Python] Add function to set Hadoop CLASSPATH

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111085#comment-16111085
 ] 

Wes McKinney commented on ARROW-1317:
-

My understanding is that you can set {{CLASSPATH}} in {{os.environ}} prior to 
JNI bootstrap. A patch would be welcome

> [Python] Add function to set Hadoop CLASSPATH 
> --
>
> Key: ARROW-1317
> URL: https://issues.apache.org/jira/browse/ARROW-1317
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> Getting access to hdfs via libhdfs requires the setting of several 
> environment variables. 
> Many of these paths should be auto-detectable requiring less or perhaps even 
> no information from the user. This would lower the access barrier to hdfs for 
> a non-dev user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1323) [GLib] Add garrow_boolean_array_get_values()

2017-08-02 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-1323:
---

 Summary: [GLib] Add garrow_boolean_array_get_values()
 Key: ARROW-1323
 URL: https://issues.apache.org/jira/browse/ARROW-1323
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
Priority: Minor
 Fix For: 0.6.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1317) [Python] Add function to set Hadoop CLASSPATH

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1317:

Summary: [Python] Add function to set Hadoop CLASSPATH   (was: hdfs 
environment variables)

> [Python] Add function to set Hadoop CLASSPATH 
> --
>
> Key: ARROW-1317
> URL: https://issues.apache.org/jira/browse/ARROW-1317
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> Getting access to hdfs via libhdfs requires the setting of several 
> environment variables. 
> Many of these paths should be auto-detectable requiring less or perhaps even 
> no information from the user. This would lower the access barrier to hdfs for 
> a non-dev user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1319) [Python] Add additional HDFS filesystem methods

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111084#comment-16111084
 ] 

Wes McKinney commented on ARROW-1319:
-

Quite a few of them were added in ARROW-1301. Can you make a list of which 
additional ones are needed (that are not accounted for by other JIRAs already)?

> [Python] Add additional HDFS filesystem methods
> ---
>
> Key: ARROW-1319
> URL: https://issues.apache.org/jira/browse/ARROW-1319
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
> contains a wider set of file-system methods than arrow's python bindings. 
> These are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1319) [Python] Add additional HDFS filesystem methods

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1319:

Summary: [Python] Add additional HDFS filesystem methods  (was: hdfs 
methods)

> [Python] Add additional HDFS filesystem methods
> ---
>
> Key: ARROW-1319
> URL: https://issues.apache.org/jira/browse/ARROW-1319
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
> contains a wider set of file-system methods than arrow's python bindings. 
> These are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1322) hdfs: encryption-at-rest and secure transport

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1322:


 Summary: hdfs: encryption-at-rest and secure transport
 Key: ARROW-1322
 URL: https://issues.apache.org/jira/browse/ARROW-1322
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Martin Durant


HDFS provides for encrypted data transfer and encryption of data on-disc (e.g., 
via KMS records). It would be nice to see these available within arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1318) [C++] hdfs access with auth

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1318:

Summary: [C++] hdfs access with auth  (was: hdfs access with auth)

> [C++] hdfs access with auth
> ---
>
> Key: ARROW-1318
> URL: https://issues.apache.org/jira/browse/ARROW-1318
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++
>Reporter: Martin Durant
>
> A wide variety of authentication schemes are available in hadoop.
> This issue is to track whether libhdfs can successfully operate with them. 
> The list includes:
> - user/password
> - basic kerberos (via kinit and via keytabs)
> - kerberos with active directory and single-sign-on
> - "privacy" and "integrity" modes
> - access with hdfs delegation token
> - probably others...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1318) [C++] hdfs access with auth

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1318:

Component/s: C++

> [C++] hdfs access with auth
> ---
>
> Key: ARROW-1318
> URL: https://issues.apache.org/jira/browse/ARROW-1318
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++
>Reporter: Martin Durant
>
> A wide variety of authentication schemes are available in hadoop.
> This issue is to track whether libhdfs can successfully operate with them. 
> The list includes:
> - user/password
> - basic kerberos (via kinit and via keytabs)
> - kerberos with active directory and single-sign-on
> - "privacy" and "integrity" modes
> - access with hdfs delegation token
> - probably others...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1321) hdfs delegation token functions

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1321:


 Summary: hdfs delegation token functions
 Key: ARROW-1321
 URL: https://issues.apache.org/jira/browse/ARROW-1321
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Martin Durant


HDFS can create delegation tokens for an authenticated user, so that access to 
the file-system from other processes/machines can authenticate as that same 
user without having to use third-party identity systems (kerberos, etc.).

arrow-hdfs should provide the ability to accept, create, renew and cancel 
delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (ARROW-1320) hdfs block locations

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-1320.
---
Resolution: Duplicate

Duplicate of ARROW-473

> hdfs block locations
> 
>
> Key: ARROW-1320
> URL: https://issues.apache.org/jira/browse/ARROW-1320
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Martin Durant
>
> To provide a function which can return the set of machines on which the data 
> blocks of a given hdfs file are stored. This is best for scheduling systems 
> (e.g., dask) which can move the computation to the machine which has the 
> data, and so cut out network data traffic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1316) hdfs connector stand-alone

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111075#comment-16111075
 ] 

Wes McKinney commented on ARROW-1316:
-

I am not sure this is possible. To use libhdfs to access an HDFS cluster, you 
need:

* A JVM installation
* The Hadoop client libraries in your classpath
* File system-like API for the libhdfs library

These are provided respectively by the JDK install, the Hadoop install, and the 
Arrow libraries. The Arrow interface to HDFS provides a consistent API as other 
files (https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.h). 
This is the same approach used in TensorFlow 
(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/hadoop/hadoop_file_system.h)
 and other projects. 

> hdfs connector stand-alone
> --
>
> Key: ARROW-1316
> URL: https://issues.apache.org/jira/browse/ARROW-1316
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Martin Durant
>
> Currently, access to hdfs via libhdfs requires the whole of arrow, a java 
> installation and a hadoop installation. This setup is indeed common, such as 
> on "cluster edge-nodes".
> This issue is posted with the wish that hdfs file-system access could be done 
> without needing the whole set of installations, above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1320) hdfs block locations

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1320:


 Summary: hdfs block locations
 Key: ARROW-1320
 URL: https://issues.apache.org/jira/browse/ARROW-1320
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Martin Durant


To provide a function which can return the set of machines on which the data 
blocks of a given hdfs file are stored. This is best for scheduling systems 
(e.g., dask) which can move the computation to the machine which has the data, 
and so cut out network data traffic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1319) hdfs methods

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1319:


 Summary: hdfs methods
 Key: ARROW-1319
 URL: https://issues.apache.org/jira/browse/ARROW-1319
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Martin Durant


The python library hdfs3 http://hdfs3.readthedocs.io/en/latest/api.html 
contains a wider set of file-system methods than arrow's python bindings. These 
are probably simple to implement for arrow-hdfs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1318) hdfs access with auth

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1318:


 Summary: hdfs access with auth
 Key: ARROW-1318
 URL: https://issues.apache.org/jira/browse/ARROW-1318
 Project: Apache Arrow
  Issue Type: Test
Reporter: Martin Durant


A wide variety of authentication schemes are available in hadoop.

This issue is to track whether libhdfs can successfully operate with them. The 
list includes:
- user/password
- basic kerberos (via kinit and via keytabs)
- kerberos with active directory and single-sign-on
- "privacy" and "integrity" modes
- access with hdfs delegation token
- probably others...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1317) hdfs environment variables

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1317:


 Summary: hdfs environment variables
 Key: ARROW-1317
 URL: https://issues.apache.org/jira/browse/ARROW-1317
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Martin Durant


Getting access to hdfs via libhdfs requires the setting of several environment 
variables. 
Many of these paths should be auto-detectable requiring less or perhaps even no 
information from the user. This would lower the access barrier to hdfs for a 
non-dev user.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1313) [C++/Python] Add troubleshooting section for setting up HDFS JNI interface

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111067#comment-16111067
 ] 

Wes McKinney commented on ARROW-1313:
-

My understanding is that the safest thing to do in production is use the 
libhdfs.so that is shipped with a particular Hadoop distribution (since there 
may be internal details that are particular to that version of Hadoop); while 
the public C API is the same between versions, in theory there could be 
internal details in the JNI implementation that break the Java "ABI". The 
Hadoop community would be able to give better advice

> [C++/Python] Add troubleshooting section for setting up HDFS JNI interface
> --
>
> Key: ARROW-1313
> URL: https://issues.apache.org/jira/browse/ARROW-1313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
> Environment: linux trusty-cdh5
>Reporter: Martin Durant
> Fix For: 0.6.0
>
>
> The hadoop library directory contains a libhdfs.a and a libhadoop.so but no 
> libhdfs.so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1316) hdfs connector stand-alone

2017-08-02 Thread Martin Durant (JIRA)
Martin Durant created ARROW-1316:


 Summary: hdfs connector stand-alone
 Key: ARROW-1316
 URL: https://issues.apache.org/jira/browse/ARROW-1316
 Project: Apache Arrow
  Issue Type: Wish
Reporter: Martin Durant


Currently, access to hdfs via libhdfs requires the whole of arrow, a java 
installation and a hadoop installation. This setup is indeed common, such as on 
"cluster edge-nodes".

This issue is posted with the wish that hdfs file-system access could be done 
without needing the whole set of installations, above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-786) [Format] In-memory format for 128-bit Decimals, handling of sign bit

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111057#comment-16111057
 ] 

Wes McKinney commented on ARROW-786:


OK, sweet, that would be awesome. 

> [Format] In-memory format for 128-bit Decimals, handling of sign bit
> 
>
> Key: ARROW-786
> URL: https://issues.apache.org/jira/browse/ARROW-786
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format
>Reporter: Wes McKinney
> Fix For: 0.7.0
>
>
> cc [~cpcloud]
> We found in ARROW-655 that we needed to add an extra bit for signedness for 
> decimals stored as 128-bit values to be able to use the Boost multiprecision 
> libraries. This makes Decimal128 not fit completely neatly as a 16-byte fixed 
> size binary value, and more of a {{struct fixed_size_binary(16)>}}. What is the current formata in the Java 
> implementation? We will need to document the memory layout for decimals that 
> maximizes compatibility across languages and eventually implement integration 
> tests for IPC. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111048#comment-16111048
 ] 

Wes McKinney commented on ARROW-1314:
-

Note that the {{pyarrow.hdfs}} namespace is new in 0.6.0 (releasing in next 
couple of weeks), to connect with <= 0.5.0, use {{pyarrow.HdfsClient}}

> [C++] Provide installation guidance for macOS users who wish to use JNI-based 
> HDFS interface
> 
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1314:

Summary: [C++] Provide installation guidance for macOS users who wish to 
use JNI-based HDFS interface  (was: libhdfs installation didn't work - mac)

> [C++] Provide installation guidance for macOS users who wish to use JNI-based 
> HDFS interface
> 
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1314) [C++] Provide installation guidance for macOS users who wish to use JNI-based HDFS interface

2017-08-02 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1314:

Component/s: Documentation

> [C++] Provide installation guidance for macOS users who wish to use JNI-based 
> HDFS interface
> 
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1315) [GLib] Status check of arrow::ArrayBuilder::Finish() is missing

2017-08-02 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-1315:
---

 Summary: [GLib] Status check of arrow::ArrayBuilder::Finish() is 
missing
 Key: ARROW-1315
 URL: https://issues.apache.org/jira/browse/ARROW-1315
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou
Priority: Minor
 Fix For: 0.6.0






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111043#comment-16111043
 ] 

Wes McKinney commented on ARROW-1314:
-

I don't think Linux shared libraries (like libhdfs.so, libhdfs3.so) can be 
loaded on Mac. So libhdfs needs to be compiled for the macOS architecture. It 
looks like some other projects have documented this; we could go through the 
exercise and add it to the project documentation: 
https://github.com/forward/node-hdfs#mac-osx

> libhdfs installation didn't work - mac
> --
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac

2017-08-02 Thread Martin Durant (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111041#comment-16111041
 ] 

Martin Durant commented on ARROW-1314:
--

It is the general distribution, e.g., 
http://mirror.csclub.uwaterloo.ca/apache/hadoop/common/ (which is, of course, 
just java).

If the answer is "you shouldn't run hadoop on mac", I understand; however, I 
did get hdfs3 working with this distro.

> libhdfs installation didn't work - mac
> --
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1314) libhdfs installation didn't work - mac

2017-08-02 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111021#comment-16111021
 ] 

Wes McKinney commented on ARROW-1314:
-

Where did you obtain the Hadoop distribution for Mac? 

> libhdfs installation didn't work - mac
> --
>
> Key: ARROW-1314
> URL: https://issues.apache.org/jira/browse/ARROW-1314
> Project: Apache Arrow
>  Issue Type: Improvement
> Environment: mac 10.12.6 
>Reporter: Martin Durant
>
> Having set 
> HADOOP_HOME /Users/mdurant/Downloads/hadoop-2.8.1  (straight download, does 
> contain libhdfs.so in native)
> java openjdk version "1.8.0_121" in anaconda install directory
> and CLASSPATH as in the docs (too long to show)
> ```
> In [3]: pa.hdfs
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
> > 1 pa.hdfs
> AttributeError: module 'pyarrow' has no attribute 'hdfs'
> In [4]: pa.have_libhdfs()
> Out[4]: False
> In [5]: pa.have_libhdfs3()
> Out[5]: False
> ```
> (I also have libhdfs3.so - not .dylib - but it is not found even if included 
> in DYLD_FALLBACK_LIBRARY_PATH)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1313) [C++/Python] Add troubleshooting section for setting up HDFS JNI interface

2017-08-02 Thread Martin Durant (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111018#comment-16111018
 ] 

Martin Durant commented on ARROW-1313:
--

That would install the whole of hadoop as system packages, so there would be 
two separate ones with the CHD install from before. 
libhdfs.so is only 200kB, can it not be distributed?

> [C++/Python] Add troubleshooting section for setting up HDFS JNI interface
> --
>
> Key: ARROW-1313
> URL: https://issues.apache.org/jira/browse/ARROW-1313
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
> Environment: linux trusty-cdh5
>Reporter: Martin Durant
> Fix For: 0.6.0
>
>
> The hadoop library directory contains a libhdfs.a and a libhadoop.so but no 
> libhdfs.so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1296) [Java] templates/FixValueVectors reset() method doesn't set allocationSizeInBytes correctly

2017-08-02 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin reassigned ARROW-1296:
-

Assignee: Li Jin

> [Java] templates/FixValueVectors reset() method doesn't set 
> allocationSizeInBytes correctly
> ---
>
> Key: ARROW-1296
> URL: https://issues.apache.org/jira/browse/ARROW-1296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Affects Versions: 0.5.0
>Reporter: Li Jin
>Assignee: Li Jin
> Fix For: 0.6.0
>
>
> [~siddteotia] pointed out reset() in templates/FixValueVectors.java should 
> set:
> {code}
> allocationSizeInBytes = INITIAL_VALUE_ALLOCATION * ${type.width}
> {code}
> instead of:
> {code}
> allocationSizeInBytes = INITIAL_VALUE_ALLOCATION
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >