[jira] [Commented] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371177#comment-16371177
 ] 

Antoine Pitrou commented on ARROW-2192:
---

I see... So we should be able to decide this based on the environment variables 
exported by Travis-CI :

https://docs.travis-ci.com/user/environment-variables/#Convenience-Variables

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2192:
-

Assignee: Antoine Pitrou

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-2192:
--
Component/s: Continuous Integration

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371207#comment-16371207
 ] 

ASF GitHub Bot commented on ARROW-2192:
---

pitrou opened a new pull request #1634: ARROW-2192: [CI] Always build on master 
branch and repository
URL: https://github.com/apache/arrow/pull/1634
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2192:
--
Labels: pull-request-available  (was: )

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2193:
-

 Summary: [Plasma] plasma_store forks endlessly
 Key: ARROW-2193
 URL: https://issues.apache.org/jira/browse/ARROW-2193
 Project: Apache Arrow
  Issue Type: Bug
  Components: Plasma (C++)
Reporter: Antoine Pitrou


I'm not sure why, but when I run the pyarrow test suite (for example {{py.test 
pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:

{code:bash}
 $ ps fuwww
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
[...]
antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
/home/antoine/miniconda3/envs/pyarrow/bin/python 
/home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 -m 
1
[etc.]
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371363#comment-16371363
 ] 

Antoine Pitrou commented on ARROW-2193:
---

After rebuilding from scratch, I get another issue:

{code:bash}
$ plasma_store -m 10
plasma_store: error while loading shared libraries: libboost_system.so.1.66.0: 
cannot open shared object file: No such file or directory
{code}

which is quite weird since I have the {{boost-cpp}} package installed from 
conda-forge:

{code:bash}
$ which plasma_store 
/home/antoine/miniconda3/envs/pyarrow/bin/plasma_store
$ locate libboost_system.so.1.66.0
[...]
/home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.66.0
{code}

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371514#comment-16371514
 ] 

Wes McKinney commented on ARROW-2193:
-

I guess libboost_system is not in LD_LIBRARY_PATH. Boost ought to be 
statically-linked if possible, though

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371533#comment-16371533
 ] 

Wes McKinney commented on ARROW-2192:
-

I think the Appveyor builds are OK as is (they will always run), but we can 
address them further in the future if need be

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371534#comment-16371534
 ] 

ASF GitHub Bot commented on ARROW-2176:
---

alendit commented on issue #1629: ARROW-2176: [C++] Extend DictionaryBuilder to 
support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629#issuecomment-367360826
 
 
   Added a comment to the `DictionaryBuilder` and ammended the comments on 
`Finish` and `FinishInternal`. Also rebased on newest `master`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Extend DictionaryBuilder to support delta dictionaries
> 
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1780) JDBC Adapter for Apache Arrow

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1780:
---

Assignee: Atul Dambalkar

> JDBC Adapter for Apache Arrow
> -
>
> Key: ARROW-1780
> URL: https://issues.apache.org/jira/browse/ARROW-1780
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Atul Dambalkar
>Assignee: Atul Dambalkar
>Priority: Major
> Fix For: 0.10.0
>
>
> At a high level the JDBC Adapter will allow upstream apps to query RDBMS data 
> over JDBC and get the JDBC objects converted to Arrow objects/structures. The 
> upstream utility can then work with Arrow objects/structures with usual 
> performance benefits. The utility will be very much similar to C++ 
> implementation of "Convert a vector of row-wise data into an Arrow table" as 
> described here - 
> https://arrow.apache.org/docs/cpp/md_tutorials_row_wise_conversion.html
> The utility will read data from RDBMS and covert the data into Arrow 
> objects/structures. So from that perspective this will Read data from RDBMS, 
> If the utility can push Arrow objects to RDBMS is something need to be 
> discussed and will be out of scope for this utility for now. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371565#comment-16371565
 ] 

Antoine Pitrou commented on ARROW-2193:
---

What's weird is that it used to work correctly less than one week ago. Did 
plasma_store start relying on boost very recently?

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371589#comment-16371589
 ] 

Wes McKinney commented on ARROW-2193:
-

Sounds like a regression related to the build toolchain. Are you able to git 
bisect to find the commit that broke things?

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-799:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] Provide guidance in documentation for using Arrow in an uberjar 
> setting 
> ---
>
> Key: ARROW-799
> URL: https://issues.apache.org/jira/browse/ARROW-799
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Jingyuan Wang
>Assignee: Li Jin
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently, ArrowBuf class directly access the package-private fields of 
> AbstractByteBuf class which makes shading Apache Arrow problematic. If we 
> relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would 
> throw out IllegalAccessException.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1913) [Java] Fix Javadoc generation bugs with JDK8

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1913:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] Fix Javadoc generation bugs with JDK8
> 
>
> Key: ARROW-1913
> URL: https://issues.apache.org/jira/browse/ARROW-1913
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Priority: Minor
> Fix For: 0.10.0
>
>
> While trying to cut the release candidate, the source release script fails 
> due to various new Javadoc issues



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2077) [Python] Document on how to use Storefact & Arrow to read Parquet from S3/Azure/...

2018-02-21 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2077:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Document on how to use Storefact & Arrow to read Parquet from 
> S3/Azure/...
> ---
>
> Key: ARROW-2077
> URL: https://issues.apache.org/jira/browse/ARROW-2077
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We're using this happily in production, also with column projection down to 
> the storage layer. Others should also benefit from this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1731:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Provide for selecting a subset of columns to convert in 
> RecordBatch/Table.from_pandas
> --
>
> Key: ARROW-1731
> URL: https://issues.apache.org/jira/browse/ARROW-1731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently it's all-or-nothing, and to do the subsetting in pandas incurs a 
> data copy. This would enable columns (by name or index) to be selected out 
> without additional data copying
> cc [~cpcloud] [~jreback]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1858) [Python] Add documentation about parquet.write_to_dataset and related methods

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1858:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add documentation about parquet.write_to_dataset and related methods
> -
>
> Key: ARROW-1858
> URL: https://issues.apache.org/jira/browse/ARROW-1858
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See 
> https://stackoverflow.com/questions/47482434/can-pyarrow-write-multiple-parquet-files-to-a-folder-like-fastparquets-file-sch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1949) [Python] Add option to Array.from_pandas and pyarrow.array to perform unsafe casts

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1949:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add option to Array.from_pandas and pyarrow.array to perform unsafe 
> casts
> --
>
> Key: ARROW-1949
> URL: https://issues.apache.org/jira/browse/ARROW-1949
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Per mailing list thread



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371691#comment-16371691
 ] 

Wes McKinney commented on ARROW-2038:
-

[~jim.crist] [~cpcloud] [~xhochy] can we see if something needs to be fixed 
here for 0.9.0?

> [Python] Follow-up bug fixes for s3fs Parquet support
> -
>
> Key: ARROW-2038
> URL: https://issues.apache.org/jira/browse/ARROW-2038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see discussion in 
> https://github.com/apache/arrow/pull/916#issuecomment-360558248



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2057) [Python] Configure size of data pages in pyarrow.parquet.write_table

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2057:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Configure size of data pages in pyarrow.parquet.write_table
> 
>
> Key: ARROW-2057
> URL: https://issues.apache.org/jira/browse/ARROW-2057
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> It would be useful to be able to set the size of data pages (within Parquet 
> column chunks) from Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1983:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add ability to write parquet `_metadata` file
> --
>
> Key: ARROW-1983
> URL: https://issues.apache.org/jira/browse/ARROW-1983
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently `pyarrow.parquet` can only write the `_common_metadata` file 
> (mostly just schema information). It would be useful to add the ability to 
> write a `_metadata` file as well. This should include information about each 
> row group in the dataset, including summary statistics. Having this summary 
> file would allow filtering of row groups without needing to access each file 
> beforehand.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2113) [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the classpath setting HDFS logic

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2113:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Incomplete CLASSPATH with "hadoop" contained in it can fool the 
> classpath setting HDFS logic
> -
>
> Key: ARROW-2113
> URL: https://issues.apache.org/jira/browse/ARROW-2113
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Linux Redhat 7.4, Anaconda 4.4.7, Python 2.7.12, CDH 
> 5.13.1
>Reporter: Michal Danko
>Priority: Major
> Fix For: 0.10.0
>
>
> Steps to replicate the issue:
> mkdir /tmp/test
>  cd /tmp/test
>  mkdir jars
>  cd jars
>  touch test1.jar
>  mkdir -p ../lib/zookeeper
>  cd ../lib/zookeeper
>  ln -s ../../jars/test1.jar ./test1.jar
>  ln -s test1.jar test.jar
>  mkdir -p ../hadoop/lib
>  cd ../hadoop/lib
>  ln -s ../../../lib/zookeeper/test.jar ./test.jar
> (this part depends on your configuration you need those values for 
> pyarrow.hdfs to work: )
> (path to libjvm: )
> (export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera)
> (path to libhdfs: )
> (export 
> LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib64/)
> export CLASSPATH="/tmp/test/lib/hadoop/lib/test.jar"
> python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Ends with error:
> 
>  loadFileSystems error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  hdfsBuilderConnect(forceNewInstance=0, nn=default, port=0, 
> kerbTicketCachePath=(NULL), userName=pa) error:
>  (unable to get root cause for java.lang.NoClassDefFoundError)
>  (unable to get stack trace for java.lang.NoClassDefFoundError)
>  Traceback (most recent call last): (
>  File "", line 1, in 
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 170, in connect
>  kerb_ticket=kerb_ticket, driver=driver)
>  File "/opt/pa/anaconda2/lib/python2.7/site-packages/pyarrow/hdfs.py", line 
> 37, in __init__
>  self._connect(host, port, user, kerb_ticket, driver)
>  File "pyarrow/io-hdfs.pxi", line 87, in 
> pyarrow.lib.HadoopFileSystem._connect 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:61673)
>  File "pyarrow/error.pxi", line 79, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8345)
>  pyarrow.lib.ArrowIOError: HDFS connection failed
>  -
>  
> export CLASSPATH="/tmp/test/lib/zookeeper/test.jar"
>  python
>  import pyarrow.hdfs as hdfs;
>  fs = hdfs.connect(user="hdfs")
>  
> Works properly.
>  
> I can't find reason why first CLASSPATH doesn't work and second one does, 
> because it's path to same .jar, just with extra symlink in it. To me, it 
> looks like pyarrow.lib.check has problem with symlinks defined with many 
> ../.../.. .
> I would expect that pyarrow would work with any definition of path to .jar
> Please notice that path are not generated at random, it is path copied from 
> Cloudera distribution of Hadoop (original file was zookeeper.jar),
> Because of this issue, our customer currently can't use pyarrow lib for oozie 
> workflows.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2059) [Python] Possible performance regression in Feather read/write path

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2059:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Possible performance regression in Feather read/write path
> ---
>
> Key: ARROW-2059
> URL: https://issues.apache.org/jira/browse/ARROW-2059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://github.com/wesm/feather/issues/329. Needs to be 
> investigated



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2060) [Python] Documentation for creating StructArray using from_arrays or a sequence of dicts

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2060:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Documentation for creating StructArray using from_arrays or a 
> sequence of dicts
> 
>
> Key: ARROW-2060
> URL: https://issues.apache.org/jira/browse/ARROW-2060
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Follow-up work to ARROW-1705 and ARROW-1706



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371699#comment-16371699
 ] 

Wes McKinney commented on ARROW-2131:
-

This can be possibly solved by modifying the environment when opening the 
subprocess: 
https://stackoverflow.com/questions/2231227/python-subprocess-popen-with-a-modified-environment

> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2182) [Python] ASV benchmark setup does not account for C++ library changing

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2182:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] ASV benchmark setup does not account for C++ library changing
> --
>
> Key: ARROW-2182
> URL: https://issues.apache.org/jira/browse/ARROW-2182
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See https://github.com/apache/arrow/blob/master/python/README-benchmarks.md
> Perhaps we could create a helper script that will run all the 
> currently-defined benchmarks for a specific commit, and ensure that we are 
> running against pristine, up-to-date release builds of Arrow (and any other 
> dependencies, like parquet-cpp) at that commit? 
> cc [~pitrou]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1643:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1989) [Python] Better UX on timestamp conversion to Pandas

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1989:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Better UX on timestamp conversion to Pandas
> 
>
> Key: ARROW-1989
> URL: https://issues.apache.org/jira/browse/ARROW-1989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Converting timestamp columns to Pandas, users often have the problem that 
> they have dates that are larger than Pandas can represent with their 
> nanosecond representation. Currently they simply see an Arrow exception and 
> think that this problem is caused by Arrow. We should try to change the error 
> from
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: XX
> {code}
> to something along the lines of 
> {code}
> ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> XX. This conversion is needed as Pandas does only support nanosecond 
> timestamps. Your data is likely out of the range that can be represented with 
> nanosecond resolution.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1682:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add documentation / example for reading a directory of Parquet files 
> on S3
> ---
>
> Key: ARROW-1682
> URL: https://issues.apache.org/jira/browse/ARROW-1682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Opened based on comment 
> https://github.com/apache/arrow/pull/916#issuecomment-337563492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2061) [C++] Run ASAN builds in Travis CI

2018-02-21 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2061:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Run ASAN builds in Travis CI
> --
>
> Key: ARROW-2061
> URL: https://issues.apache.org/jira/browse/ARROW-2061
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> ASAN might be a better alternative to valgrind in builds where we have clang 
> available. As part of this, we should also document how users can run their 
> own local ASAN builds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2041) [Python] pyarrow.serialize has high overhead for list of NumPy arrays

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2041:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] pyarrow.serialize has high overhead for list of NumPy arrays
> -
>
> Key: ARROW-2041
> URL: https://issues.apache.org/jira/browse/ARROW-2041
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Richard Shin
>Priority: Major
> Fix For: 0.10.0
>
>
> {{Python 2.7.12 (default, Nov 20 2017, 18:23:56)}}
> {{[GCC 5.4.0 20160609] on linux2}}
> {{Type "help", "copyright", "credits" or "license" for more information.}}
> {{>>> import pyarrow as pa, numpy as np}}
> {{>>> arrays = [np.arange(100, dtype=np.int32) for _ in range(1)]}}
> {{>>> with open('test.pyarrow', 'w') as f:}}
> {{... f.write(pa.serialize(arrays).to_buffer().to_pybytes())}}
> {{...}}
> {{>>> import cPickle as pickle}}
> {{>>> pickle.dump(arrays, open('test.pkl', 'w'), pickle.HIGHEST_PROTOCOL)}}
> test.pyarrow is 6.2 MB, while test.pkl is only 4.2 MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2176:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Extend DictionaryBuilder to support delta dictionaries
> 
>
> Key: ARROW-2176
> URL: https://issues.apache.org/jira/browse/ARROW-2176
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Dimitri Vorona
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a 
> possibility of sending additional dictionary batches with a previously seen 
> id and a isDelta flag to extend the existing dictionaries with new entries. 
> Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not 
> support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder 
> with delta dictionaries support. The use API can be seen in the dictionary 
> tests (i.e. 
> [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]).
>  The basic idea is that the user just reuses the builder object after calling 
> Finish(Array*) for the first time. Subsequent calls to Append will create new 
> entries only for the unseen element and reuse id from previous dictionaries 
> for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which 
> explicitly indicates a desire to use the builder for delta dictionary 
> generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to 
> store the seen items. This adds a copy on each Finish call and an additional 
> lookup at each GetItem or Append call. I assume, we might get away with 
> returning Array slices at Finish, which would remove the need for an 
> additional overflow dictionary. If the gist of the PR is approved, I can look 
> into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the 
> DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2130) [Python] Support converting pandas.Timestamp in pyarrow.array

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2130:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Support converting pandas.Timestamp in pyarrow.array
> -
>
> Key: ARROW-2130
> URL: https://issues.apache.org/jira/browse/ARROW-2130
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> This is follow up work to ARROW-2106; since pandas.Timestamp supports 
> nanoseconds, this will require a slightly different code path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas

2018-02-21 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-1731:
--

Assignee: Uwe L. Korn

> [Python] Provide for selecting a subset of columns to convert in 
> RecordBatch/Table.from_pandas
> --
>
> Key: ARROW-1731
> URL: https://issues.apache.org/jira/browse/ARROW-1731
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently it's all-or-nothing, and to do the subsetting in pandas incurs a 
> data copy. This would enable columns (by name or index) to be selected out 
> without additional data copying
> cc [~cpcloud] [~jreback]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2142:
--
Labels: pull-request-available  (was: )

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371721#comment-16371721
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou opened a new pull request #1635: ARROW-2142: [Python] Allow conversion 
from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2142:
-

Assignee: Antoine Pitrou

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371723#comment-16371723
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r169718158
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) {
   return PushArray(result->data());
 }
 
+Status NumPyConverter::Visit(const StructType& type) {
+  std::vector sub_converters;
+  std::vector sub_arrays;
+
+  {
+PyAcquireGIL gil_lock;
+
+// Create converters for each struct type field
+if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) {
+  return Status::TypeError("Expected struct array");
+}
+
+for (auto field : type.children()) {
+  PyObject* tup = PyDict_GetItemString(dtype_->fields, 
field->name().c_str());
+  if (tup == NULL) {
+std::stringstream ss;
+ss << "Missing field '" << field->name() << "' in struct array";
+return Status::TypeError(ss.str());
+  }
+  PyArray_Descr* sub_dtype = 
reinterpret_cast(PyTuple_GET_ITEM(tup, 0));
+  DCHECK(PyArray_DescrCheck(sub_dtype));
+  int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1)));
+  RETURN_IF_PYERROR();
+  Py_INCREF(sub_dtype);  /* PyArray_GetField() steals ref */
+  PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset);
+  RETURN_IF_PYERROR();
+  sub_arrays.emplace_back(sub_array);
+  sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */,
+  field->type(), use_pandas_null_sentinels_);
+}
+  }
+
+  std::vector groups;
+
+  // Compute null bitmap and store it as a Null Array to include it
+  // in the rechunking below
+  {
+int64_t null_count = 0;
+if (mask_ != nullptr) {
+  RETURN_NOT_OK(InitNullBitmap());
+  null_count = MaskToBitmap(mask_, length_, null_bitmap_data_);
+}
+auto null_data = ArrayData::Make(std::make_shared(), length_,
+ {null_bitmap_}, null_count, 0);
 
 Review comment:
   Note this is a bit of hack, since typically null arrays don't have an 
underlying buffer at all.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371726#comment-16371726
 ] 

Antoine Pitrou commented on ARROW-2142:
---

I ended up applied your suggestion on array vectors rather than chunked array 
(see attached PR).

> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1963) [Python] Create Array from sequence of numpy.datetime64

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1963:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Create Array from sequence of numpy.datetime64
> ---
>
> Key: ARROW-1963
> URL: https://issues.apache.org/jira/browse/ARROW-1963
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> Currently we only support {{datetime.datetime}} and {{datetime.date}} but 
> {{numpy.datetime64}} also occurs quite often in the numpy/pandas-related 
> world.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1848:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Add documentation examples for reading single Parquet files and 
> datasets from HDFS
> ---
>
> Key: ARROW-1848
> URL: https://issues.apache.org/jira/browse/ARROW-1848
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> see 
> https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1380) [C++] Fix "still reachable" valgrind warnings in Plasma Python unit tests

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1380:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Fix "still reachable" valgrind warnings in Plasma Python unit tests
> -
>
> Key: ARROW-1380
> URL: https://issues.apache.org/jira/browse/ARROW-1380
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> I thought I fixed this, but they seem to have recurred:
> https://travis-ci.org/apache/arrow/jobs/266421430#L5220



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2132:
---

Assignee: Wes McKinney

> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2194) Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2194:
-

 Summary: Pandas columns metadata incorrect for empty string columns
 Key: ARROW-2194
 URL: https://issues.apache.org/jira/browse/ARROW-2194
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Florian Jetter


The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
DataFrame is unexpectedly {{float64}}

 
{code}
import numpy as np
import pandas as pd
import pyarrow as pa
import json

empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
np.array([], dtype=np.bytes_)})
empty_table = pa.Table.from_pandas(empty_df)
json.loads(empty_table.schema.metadata[b'pandas'])['columns']

# Same behavior for input dtype np.unicode_
[{u'field_name': u'bytes',
u'metadata': None,
u'name': u'bytes',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'unicode',
u'metadata': None,
u'name': u'unicode',
u'numpy_type': u'object',
u'pandas_type': u'float64'},
{u'field_name': u'__index_level_0__',
u'metadata': None,
u'name': None,
u'numpy_type': u'int64',
u'pandas_type': u'int64'}]{code}
 

Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2132:
--
Labels: pull-request-available  (was: )

> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371740#comment-16371740
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

wesm opened a new pull request #1636: ARROW-2132: Add link to Plasma in main 
README
URL: https://github.com/apache/arrow/pull/1636
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2093:
---

Assignee: Wes McKinney

> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2093:
--
Labels: pull-request-available  (was: )

> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2180:
---

Assignee: Wes McKinney

> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2093) [Python] Possibly do not test pytorch serialization in Travis CI

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371748#comment-16371748
 ] 

ASF GitHub Bot commented on ARROW-2093:
---

wesm opened a new pull request #1637: ARROW-2093: [Python] Do not install 
PyTorch in Travis CI
URL: https://github.com/apache/arrow/pull/1637
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Possibly do not test pytorch serialization in Travis CI
> 
>
> Key: ARROW-2093
> URL: https://issues.apache.org/jira/browse/ARROW-2093
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I am not sure it is worth downloading ~400MB in binaries
> {code}
> The following packages will be downloaded:
> package|build
> ---|-
> libgcc-5.2.0   |0 1.1 MB  defaults
> pillow-5.0.0   |   py27_0 958 KB  conda-forge
> libtiff-4.0.9  |0 511 KB  conda-forge
> libtorch-0.1.12|  nomkl_0 1.7 MB  defaults
> olefile-0.44   |   py27_0  50 KB  conda-forge
> torchvision-0.1.9  |   py27hdb88a65_1  86 KB  soumith
> openblas-0.2.19|214.1 MB  conda-forge
> numpy-1.13.1   |py27_blas_openblas_200 8.4 MB  
> conda-forge
> pytorch-0.2.0  |py27ha262b23_4cu75   312.2 MB  soumith
> mkl-2017.0.3   |0   129.5 MB  defaults
> 
>Total:   468.6 MB
> {code}
> Follow up from ARROW-2071 https://github.com/apache/arrow/pull/1561



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2180:
--
Labels: pull-request-available  (was: )

> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371758#comment-16371758
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm opened a new pull request #1638: ARROW-2180: [C++] Remove deprecated APIs 
from 0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371763#comment-16371763
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367414689
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.101


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2185:
--
Labels: pull-request-available  (was: )

> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371768#comment-16371768
 ] 

ASF GitHub Bot commented on ARROW-2185:
---

wesm opened a new pull request #1639: ARROW-2185: Strip CI directives from 
commit messages
URL: https://github.com/apache/arrow/pull/1639
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2185) Remove CI directives from squashed commit messages

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2185:
---

Assignee: Wes McKinney

> Remove CI directives from squashed commit messages
> --
>
> Key: ARROW-2185
> URL: https://issues.apache.org/jira/browse/ARROW-2185
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In our PR squash tool, we are potentially picking up CI directives like 
> {{[skip appveyor]}} from intermediate commits. We should regex these away and 
> instead use directives in the PR title if we wish the commit to master to 
> behave in a certain way



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1994) [Python] Test against Pandas master

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1994:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Test against Pandas master
> ---
>
> Key: ARROW-1994
> URL: https://issues.apache.org/jira/browse/ARROW-1994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We have seen recently a lot of breakage with Pandas master. This is an 
> annoyance to our users and should already break in our builds instead of 
> their chains. There is no need to add another entry to matrix, just in one of 
> them to re-run the tests with the Pandas master after they ran successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1994) [Python] Test against Pandas master

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371769#comment-16371769
 ] 

Wes McKinney commented on ARROW-1994:
-

This would be nice to have. Are there nightly pandas conda builds we could use? 
Otherwise this will increase our build times too much

> [Python] Test against Pandas master
> ---
>
> Key: ARROW-1994
> URL: https://issues.apache.org/jira/browse/ARROW-1994
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 0.10.0
>
>
> We have seen recently a lot of breakage with Pandas master. This is an 
> annoyance to our users and should already break in our builds instead of 
> their chains. There is no need to add another entry to matrix, just in one of 
> them to re-run the tests with the Pandas master after they ran successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1887) [Python] More efficient serialization of pandas Index types in custom serialization from ARROW-1784

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1887:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] More efficient serialization of pandas Index types in custom 
> serialization from ARROW-1784
> ---
>
> Key: ARROW-1887
> URL: https://issues.apache.org/jira/browse/ARROW-1887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2192.
-
Resolution: Fixed

Issue resolved by pull request 1634
[https://github.com/apache/arrow/pull/1634]

> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2192) Commits to master should run all builds in CI matrix

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371771#comment-16371771
 ] 

ASF GitHub Bot commented on ARROW-2192:
---

wesm closed pull request #1634: ARROW-2192: [CI] Always build on master branch 
and repository
URL: https://github.com/apache/arrow/pull/1634
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ci/travis_detect_changes.py b/ci/travis_detect_changes.py
index 2aeb34fa0..d60b13227 100644
--- a/ci/travis_detect_changes.py
+++ b/ci/travis_detect_changes.py
@@ -147,19 +147,25 @@ def get_unix_shell_eval(env):
 
 
 def run_from_travis():
-desc = get_travis_commit_description()
-if '[skip travis]' in desc:
-# Skip everything
-affected = dict.fromkeys(ALL_TOPICS, False)
-elif '[force ci]' in desc or '[force travis]' in desc:
-# Test everything
+if (os.environ['TRAVIS_REPO_SLUG'] == 'apache/arrow' and
+os.environ['TRAVIS_BRANCH'] == 'master' and
+os.environ['TRAVIS_EVENT_TYPE'] != 'pull_request'):
+# Never skip anything on master builds in the official repository
 affected = dict.fromkeys(ALL_TOPICS, True)
 else:
-# Test affected topics
-affected_files = list_travis_affected_files()
-perr("Affected files:", affected_files)
-affected = get_affected_topics(affected_files)
-assert set(affected) <= set(ALL_TOPICS), affected
+desc = get_travis_commit_description()
+if '[skip travis]' in desc:
+# Skip everything
+affected = dict.fromkeys(ALL_TOPICS, False)
+elif '[force ci]' in desc or '[force travis]' in desc:
+# Test everything
+affected = dict.fromkeys(ALL_TOPICS, True)
+else:
+# Test affected topics
+affected_files = list_travis_affected_files()
+perr("Affected files:", affected_files)
+affected = get_affected_topics(affected_files)
+assert set(affected) <= set(ALL_TOPICS), affected
 
 perr("Affected topics:")
 perr(pprint.pformat(affected))


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Commits to master should run all builds in CI matrix
> 
>
> Key: ARROW-2192
> URL: https://issues.apache.org/jira/browse/ARROW-2192
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> After ARROW-2083, we are only running builds related to changed components 
> with each patch in Travis CI and Appveyor. 
> The problem with this is that when we merge patches to master, our Travis CI 
> configuration (implemented by ASF infra to help alleviate clogged up build 
> queues) is set up to cancel in-progress builds whenever a new commit is 
> merged.
> So basically we could have in our timeline:
> * Patch merged affecting C++, Python
> * Patch merged affecting Java
> * Patch merged affecting JS
> So when the Java patch is merged, any in-progress C++/Python builds will be 
> cancelled. And if the JS patch comes in, the Java builds would be immediately 
> cancelled.
> In light of this I believe on master branch we should always run all of the 
> builds unconditionally



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2194) Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2194:

Fix Version/s: 0.9.0

> Pandas columns metadata incorrect for empty string columns
> --
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371838#comment-16371838
 ] 

ASF GitHub Bot commented on ARROW-2162:
---

cpcloud closed pull request #1619: ARROW-2162: [Python/C++] Decimal Values with 
too-high precision are multiplied by 100
URL: https://github.com/apache/arrow/pull/1619
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/python/python-test.cc 
b/cpp/src/arrow/python/python-test.cc
index a2b832bdb..b76caaece 100644
--- a/cpp/src/arrow/python/python-test.cc
+++ b/cpp/src/arrow/python/python-test.cc
@@ -201,5 +201,45 @@ TEST(BuiltinConversionTest, TestMixedTypeFails) {
   ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, &arr));
 }
 
+TEST_F(DecimalTest, FromPythonDecimalRescaleNotTruncateable) {
+  // We fail when truncating values that would lose data if cast to a decimal 
type with
+  // lower scale
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.001"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
+decimal_type, 
&value));
+}
+
+TEST_F(DecimalTest, FromPythonDecimalRescaleTruncateable) {
+  // We allow truncation of values that do not lose precision when dividing by 
10 * the
+  // difference between the scales, e.g., 1.000 -> 1.00
+  Decimal128 value;
+  OwnedRef python_decimal(this->CreatePythonDecimal("1.000"));
+  auto type = ::arrow::decimal(10, 2);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_OK(
+  internal::DecimalFromPythonDecimal(python_decimal.obj(), decimal_type, 
&value));
+  ASSERT_EQ(100, value.low_bits());
+}
+
+TEST_F(DecimalTest, TestOverflowFails) {
+  Decimal128 value;
+  int32_t precision;
+  int32_t scale;
+  OwnedRef python_decimal(
+  this->CreatePythonDecimal("9.9"));
+  ASSERT_OK(
+  internal::InferDecimalPrecisionAndScale(python_decimal.obj(), 
&precision, &scale));
+  ASSERT_EQ(38, precision);
+  ASSERT_EQ(1, scale);
+
+  auto type = ::arrow::decimal(38, 38);
+  const auto& decimal_type = static_cast(*type);
+  ASSERT_RAISES(Invalid, 
internal::DecimalFromPythonDecimal(python_decimal.obj(),
+decimal_type, 
&value));
+}
+
 }  // namespace py
 }  // namespace arrow
diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc
index e999854b1..a3c8cda76 100644
--- a/cpp/src/arrow/util/decimal.cc
+++ b/cpp/src/arrow/util/decimal.cc
@@ -854,26 +854,46 @@ static const Decimal128 ScaleMultipliers[] = {
 Decimal128("10"),
 Decimal128("100")};
 
+static bool RescaleWouldCauseDataLoss(const Decimal128& value, int32_t 
delta_scale,
+  int32_t abs_delta_scale, Decimal128* 
result) {
+  Decimal128 multiplier(ScaleMultipliers[abs_delta_scale]);
+
+  if (delta_scale < 0) {
+DCHECK_NE(multiplier, 0);
+Decimal128 remainder;
+Status status = value.Divide(multiplier, result, &remainder);
+DCHECK(status.ok()) << status.message();
+return remainder != 0;
+  }
+
+  *result = value * multiplier;
+  return *result < value;
+}
+
 Status Decimal128::Rescale(int32_t original_scale, int32_t new_scale,
Decimal128* out) const {
-  DCHECK_NE(out, NULLPTR);
-  DCHECK_NE(original_scale, new_scale);
-  const int32_t delta_scale = original_scale - new_scale;
+  DCHECK_NE(out, NULLPTR) << "out is nullptr";
+  DCHECK_NE(original_scale, new_scale) << "original_scale != new_scale";
+
+  const int32_t delta_scale = new_scale - original_scale;
   const int32_t abs_delta_scale = std::abs(delta_scale);
+
   DCHECK_GE(abs_delta_scale, 1);
   DCHECK_LE(abs_delta_scale, 38);
 
-  const Decimal128 scale_multiplier = ScaleMultipliers[abs_delta_scale];
-  const Decimal128 result = *this * scale_multiplier;
+  Decimal128 result(*this);
+  const bool rescale_would_cause_data_loss =
+  RescaleWouldCauseDataLoss(result, delta_scale, abs_delta_scale, out);
 
-  if (ARROW_PREDICT_FALSE(result < *this)) {
+  // Fail if we overflow or truncate
+  if (ARROW_PREDICT_FALSE(rescale_would_cause_data_loss)) {
 std::stringstream buf;
-buf << "Rescaling decimal value from original scale " << original_scale
-<< " to new scale " << new_scale << " would cause overflow";
+buf << "Rescaling decimal value " << ToString(original_scale)
+<< " from original scale of " << original_scale << " to new scale o

[jira] [Resolved] (ARROW-2162) [Python/C++] Decimal Values with too-high precision are multiplied by 100

2018-02-21 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud resolved ARROW-2162.
--
Resolution: Fixed

Issue resolved by pull request 1619
[https://github.com/apache/arrow/pull/1619]

> [Python/C++] Decimal Values with too-high precision are multiplied by 100
> -
>
> Key: ARROW-2162
> URL: https://issues.apache.org/jira/browse/ARROW-2162
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> From GitHub:
> This works as expected:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.23')], pyarrow.decimal128(10,2))[0]
> Decimal('1.23')
> {code}
> Storing an extra digit of precision multiplies the stored value by a factor 
> of 100:
> {code}
> >>> pyarrow.array([decimal.Decimal('1.234')], pyarrow.decimal128(10,2))[0]
> Decimal('123.40')
> {code}
> Ideally I would get an exception since the value I'm trying to store doesn't 
> fit in the declared type of the array. It would be less good, but still ok, 
> if the stored value were 1.23 (truncating the extra digit). I didn't expect 
> pyarrow to silently store a value that differs from the original value by a 
> factor of 100.
> I originally thought that the code was incorrectly multiplying through by an 
> extra factor of 10**scale, but that doesn't seem to be the case. If I change 
> the scale, it always seems to be a factor of 100
> {code}
> >>> pyarrow.array([decimal.Decimal('1.2345')], pyarrow.decimal128(10,3))[0]
> Decimal('123.450')
> I see the same behavior if I use floating point to initialize the array 
> rather than Python's decimal type.
> {code}
> I searched for open github and JIRA for open issues but didn't find anything 
> related to this. I am using pyarrow 0.8.0 on OS X 10.12.6 using python 2.7.14 
> installed via Homebrew



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371820#comment-16371820
 ] 

Antoine Pitrou commented on ARROW-2193:
---

Ok, this is because I recently switched from gcc-4.9 to clang-5.0. With gcc, 
plasma_store doesn't have a runtime dependency on boost:
{code:bash}
$ ldd `which plasma_store`
linux-vdso.so.1 =>  (0x7ffc8b318000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7fdc79bbe000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7fdc7983c000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7fdc79533000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7fdc7931d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7fdc78f53000)
/lib64/ld-linux-x86-64.so.2 (0x7fdc79ddb000)
{code}

But with clang I get:
{code:bash}
$ ldd `which plasma_store`
linux-vdso.so.1 =>  (0x7fff21ba4000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f0d04d5d000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f0d04b4)
libboost_system.so.1.66.0 => not found
libboost_filesystem.so.1.66.0 => not found
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 
(0x7f0d047be000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f0d044b5000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 
(0x7f0d0429f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f0d03ed5000)
/lib64/ld-linux-x86-64.so.2 (0x7f0d04f65000)
{code}

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2193:

Fix Version/s: 0.9.0

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371938#comment-16371938
 ] 

Wes McKinney commented on ARROW-2193:
-

OK, this seems buggy. I marked for 0.9.0

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1937) [Python] Add documentation for different forms of constructing nested arrays from Python data structures

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371960#comment-16371960
 ] 

Wes McKinney commented on ARROW-1937:
-

Since we have done a bunch of work on this for 0.9.0, it would be a real shame 
to not have documentation showcasing the results. I'm leaving this on 0.9.0

> [Python] Add documentation for different forms of constructing nested arrays 
> from Python data structures 
> -
>
> Key: ARROW-1937
> URL: https://issues.apache.org/jira/browse/ARROW-1937
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-971) [C++/Python] Implement Array.isvalid/notnull/isnull

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-971:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++/Python] Implement Array.isvalid/notnull/isnull
> ---
>
> Key: ARROW-971
> URL: https://issues.apache.org/jira/browse/ARROW-971
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Licht Takeuchi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> For arrays with nulls, this amounts to returning the validity bitmap. Without 
> nulls, an array of all 1 bits must be constructed. For isnull, the bits must 
> be flipped (in this case, the un-set part of the new bitmap must stay 0, 
> though).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2121) [Python] Consider special casing object arrays in pandas serializers.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2121:

Summary: [Python] Consider special casing object arrays in pandas 
serializers.  (was: Consider special casing object arrays in pandas 
serializers.)

> [Python] Consider special casing object arrays in pandas serializers.
> -
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2024) Remove global SerializationContext variables.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2024:
---

Assignee: Robert Nishihara

> Remove global SerializationContext variables.
> -
>
> Key: ARROW-2024
> URL: https://issues.apache.org/jira/browse/ARROW-2024
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We should get rid of the global variables 
> _default_serialization_context and 
> pandas_serialization_context and replace them with functions 
> default_serialization_context() and 
> pandas_serialization_context().
> This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2195:

Fix Version/s: 0.9.0

> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> Key: ARROW-2195
> URL: https://issues.apache.org/jira/browse/ARROW-2195
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
> Fix For: 0.9.0
>
>
> It can be reproduced with the following script:
> {code:python}
> import pyarrow as pa
> import pyarrow.plasma as plasma
> def retrieve1():
> client = plasma.connect('test', "", 0)
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
> client = plasma.connect('test', "", 0)
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
> bff = client.create(pid, sink.size())
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
> {code}
>  
> Preliminary backtrace:
>  
> {code}
> CESS (code=1, address=0x38158)
>     frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:
> ->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi
>     0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
> PyInt_FromLong
>     0x10e645805 <+37>: testq  %rax, %rax
>     0x10e645808 <+40>: je     0x10e64580c               ; <+44>
> (lldb) bt
>  * thread #1: tid = 0xf1378e, 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
> queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
> address=0x38158)
>   * frame #0: 0x00010e6457fc 
> lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28
>     frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 
> 133
>     frame #2: 0x00010e613b25 
> lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933
>     frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60
>     frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371875#comment-16371875
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367435997
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.102


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Deleted] (ARROW-1645) Access HDFS with read_table() automatically

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney deleted ARROW-1645:



> Access HDFS with read_table() automatically
> ---
>
> Key: ARROW-1645
> URL: https://issues.apache.org/jira/browse/ARROW-1645
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ehsan Totoni
>Priority: Major
>
> t'd be great to support accessing HDFS automatically like: 
> `pq.read_table('hdfs://example.parquet'`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2132) [Doc] Add links / mentions of Plasma store to main README

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371953#comment-16371953
 ] 

ASF GitHub Bot commented on ARROW-2132:
---

wesm commented on issue #1636: ARROW-2132: Add link to Plasma in main README
URL: https://github.com/apache/arrow/pull/1636#issuecomment-367459086
 
 
   @robertnishihara @pcmoritz could you review language and tweak as desired? 
(feel free to push to this branch)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Doc] Add links / mentions of Plasma store to main README
> -
>
> Key: ARROW-2132
> URL: https://issues.apache.org/jira/browse/ARROW-2132
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This should be listed as separate from, but noted as a part of, the C++ 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1744) [Plasma] Provide TensorFlow operator to read tensors from plasma

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1744:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Plasma] Provide TensorFlow operator to read tensors from plasma
> 
>
> Key: ARROW-1744
> URL: https://issues.apache.org/jira/browse/ARROW-1744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> see https://www.tensorflow.org/extend/adding_an_op



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2024) [Python] Remove global SerializationContext variables

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2024:

Summary: [Python] Remove global SerializationContext variables  (was: 
Remove global SerializationContext variables.)

> [Python] Remove global SerializationContext variables
> -
>
> Key: ARROW-2024
> URL: https://issues.apache.org/jira/browse/ARROW-2024
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We should get rid of the global variables 
> _default_serialization_context and 
> pandas_serialization_context and replace them with functions 
> default_serialization_context() and 
> pandas_serialization_context().
> This will also make it faster to do import pyarrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1171) C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and self-compiled turbodbc

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1171:

Fix Version/s: (was: 0.9.0)
   0.10.0

> C++: Segmentation faults on Fedora 24 with pyarrow-manylinux1 and 
> self-compiled turbodbc
> 
>
> Key: ARROW-1171
> URL: https://issues.apache.org/jira/browse/ARROW-1171
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.4.1
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Original issue: https://github.com/blue-yonder/turbodbc/issues/102
> When using the {{pyarrow}} {{manylinux1}} Wheels to build Turbodbc on Fedora 
> 24, the {{turbodbc_arrow}} unittests segfault. The main environment attribute 
> here is that the compiler version used for building Turbodbc is newer than 
> the one used for Arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2193) [Plasma] plasma_store forks endlessly

2018-02-21 Thread Robert Nishihara (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371973#comment-16371973
 ] 

Robert Nishihara commented on ARROW-2193:
-

Do you know that {{fork}} is being called? Another way this could happen is if 
the tests fail to kill the plasma store and leave a bunch of them running.

> [Plasma] plasma_store forks endlessly
> -
>
> Key: ARROW-2193
> URL: https://issues.apache.org/jira/browse/ARROW-2193
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++)
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> I'm not sure why, but when I run the pyarrow test suite (for example 
> {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly:
> {code:bash}
>  $ ps fuwww
> USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
> [...]
> antoine  27869 12.0  0.4 863208 68976 pts/7S13:41   0:01 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27885 13.0  0.4 863076 68560 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27901 12.1  0.4 863076 68320 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> antoine  27920 13.6  0.4 863208 68868 pts/7S13:41   0:01  \_ 
> /home/antoine/miniconda3/envs/pyarrow/bin/python 
> /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 
> -m 1
> [etc.]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2029) [Python] Program crash on `HdfsFile.tell` if file is closed

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2029:
---

Assignee: Jim Crist

> [Python] Program crash on `HdfsFile.tell` if file is closed
> ---
>
> Key: ARROW-2029
> URL: https://issues.apache.org/jira/browse/ARROW-2029
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Jim Crist
>Assignee: Jim Crist
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Of all the `NativeFile` methods, `tell` is the only one that doesn't check if 
> the file is still open before running. This can lead to crashes when using 
> hdfs:
>  
> {code:java}
> >>> import pyarrow as pa
> >>> h = pa.hdfs.connect()
> 18/01/24 22:31:35 WARN util.NativeCodeLoader: Unable to load native-hadoop 
> library for your platform... using builtin-java classes where applicable
> 18/01/24 22:31:36 WARN shortcircuit.DomainSocketFactory: The short-circuit 
> local reads feature cannot be used because libhadoop cannot be loaded.
> >>> with h.open("/tmp/test.txt", mode='wb') as f:
> ... pass
> ...
> >>> f.tell()
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f52ccb6733d, pid=14868, tid=0x7f52de2b9700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 
> 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x67c33d]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /working/python/hs_err_pid14868.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> Aborted
> {code}
> In python, most file-like objects raise a `ValueError` if the file is closed:
> {code:java}
> >>> f = open("test.py", mode='wb')
> >>> f.close()
> >>> f.tell()
> Traceback (most recent call last):
>   File "", line 1, in 
> ValueError: I/O operation on closed file
> >>> import io
> >>> buf = io.BytesIO()
> >>> buf.close()
> >>> buf.tell()
> Traceback (most recent call last):
>   File "", line 1, in 
> ValueError: I/O operation on closed file.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-2195:
-

 Summary: [Plasma] Segfault when retrieving RecordBatch from plasma 
store
 Key: ARROW-2195
 URL: https://issues.apache.org/jira/browse/ARROW-2195
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


It can be reproduced with the following script:

```
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
             client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
             pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
             batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
             print(batch.schema)
             print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

```

 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2168) [C++] Build toolchain builds with jemalloc

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2168:
---

Assignee: Uwe L. Korn

> [C++] Build toolchain builds with jemalloc
> --
>
> Key: ARROW-2168
> URL: https://issues.apache.org/jira/browse/ARROW-2168
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> We have fixed all known problems in the jemalloc 4.x branch and should be 
> able to gradually reactivate it in our builds to get its performance boost.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2121) [Python] Consider special casing object arrays in pandas serializers.

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2121:
---

Assignee: Robert Nishihara

> [Python] Consider special casing object arrays in pandas serializers.
> -
>
> Key: ARROW-2121
> URL: https://issues.apache.org/jira/browse/ARROW-2121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1621) [JAVA] Reduce Heap Usage per Vector

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1621:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [JAVA] Reduce Heap Usage per Vector
> ---
>
> Key: ARROW-1621
> URL: https://issues.apache.org/jira/browse/ARROW-1621
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Memory, Java - Vectors
>Reporter: Siddharth Teotia
>Assignee: Siddharth Teotia
>Priority: Major
> Fix For: 0.10.0
>
>
> https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1394) [Plasma] Add optional extension for allocating memory on GPUs

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-1394:
---

Assignee: William Paul

> [Plasma] Add optional extension for allocating memory on GPUs
> -
>
> Key: ARROW-1394
> URL: https://issues.apache.org/jira/browse/ARROW-1394
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Wes McKinney
>Assignee: William Paul
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> It would be useful to be able to allocate memory to be shared between 
> processes via Plasma using the CUDA IPC API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1463) [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated code

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371949#comment-16371949
 ] 

Wes McKinney commented on ARROW-1463:
-

Where does this work stand?

> [JAVA] Restructure ValueVector hierarchy to minimize compile-time generated 
> code
> 
>
> Key: ARROW-1463
> URL: https://issues.apache.org/jira/browse/ARROW-1463
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Jacques Nadeau
>Assignee: Siddharth Teotia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The templates used in the java package are very high mainteance and the if 
> conditions are hard to track. As started in the discussion here: 
> https://github.com/apache/arrow/pull/1012, I'd like to propose that we modify 
> the structure of the internal value vectors and code generation dynamics.
> Create new abstract base vectors:
> BaseFixedVector
> BaseVariableVector
> BaseNullableVector
> For each of these, implement all the basic functionality of a vector without 
> using templating.
> Evaluate whether to use code generation to generate specific specializations 
> of this functionality for each type where needed for performance purposes 
> (probably constrained to mutator and accessor set/get methods). Giant and 
> complex if conditions in the templates are actually worse from my perspective 
> than a small amount of hand written duplicated code since templates are much 
> harder to work with. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-549) [C++] Add function to concatenate like-typed arrays

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-549:
---
Fix Version/s: (was: 0.9.0)
   0.10.0

> [C++] Add function to concatenate like-typed arrays
> ---
>
> Key: ARROW-549
> URL: https://issues.apache.org/jira/browse/ARROW-549
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> A la 
> {{Status arrow::Concatenate(const std::vector>& 
> arrays, MemoryPool* pool, std::shared_ptr* out)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2111) [C++] Linting could be faster

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2111:
---

Assignee: Antoine Pitrou

> [C++] Linting could be faster
> -
>
> Key: ARROW-2111
> URL: https://issues.apache.org/jira/browse/ARROW-2111
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Currently {{make lint}} style-checks C++ files sequentially (by calling 
> {{cpplint}}). We could instead style-check those files in parallel.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2128) [Python] Cannot serialize array of empty lists

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2128:
---

Assignee: Uwe L. Korn

> [Python] Cannot serialize array of empty lists
> --
>
> Key: ARROW-2128
> URL: https://issues.apache.org/jira/browse/ARROW-2128
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This currently failing:
> {code:java}
> data = pd.Series([[], [], []])
> arr = pa.array(data)
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> array.pxi:181: in pyarrow.lib.array
> ???
> array.pxi:26: in pyarrow.lib._sequence_to_array
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> > ???
> E pyarrow.lib.ArrowTypeError: Unable to determine data type
> {code}
> The code in {{SeqVisitor::GetType}} suggests that we don't want to support 
> thus but I would have expected that the above should result in {{List}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2036) NativeFile should support standard IOBase methods

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2036:
---

Assignee: Jim Crist

> NativeFile should support standard IOBase methods
> -
>
> Key: ARROW-2036
> URL: https://issues.apache.org/jira/browse/ARROW-2036
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Assignee: Jim Crist
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If `NativeFile` supported most/all of the standard IOBase methods 
> ([https://docs.python.org/3/library/io.html#io.IOBase),] then it'd be easier 
> to use arrow files with other python libraries. Would at least be nice to 
> support enough operations to use `io.TextIOWrapper`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread Panchen Xue (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Panchen Xue reassigned ARROW-2184:
--

Assignee: Panchen Xue

> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Panchen Xue
>Priority: Major
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2184) [C++] Add static ctor for FileOutputStream returning shared_ptr to base OutputStream

2018-02-21 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371963#comment-16371963
 ] 

Wes McKinney commented on ARROW-2184:
-

I think we should make a decision on whether to deprecate the existing ctors in 
Arrow 0.9.0 

> [C++] Add static ctor for FileOutputStream returning shared_ptr to base 
> OutputStream
> 
>
> Key: ARROW-2184
> URL: https://issues.apache.org/jira/browse/ARROW-2184
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> It would be useful for most IO ctors to return pointers to the base interface 
> that they implement rather than the subclass. Whether we deprecate the 
> current ones will vary on a case by case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2166) [GLib] Implement Slice for Column

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2166:
---

Assignee: yosuke shiro

> [GLib] Implement Slice for Column
> -
>
> Key: ARROW-2166
> URL: https://issues.apache.org/jira/browse/ARROW-2166
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: yosuke shiro
>Assignee: yosuke shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Add {{Slice}} api to Column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2195) [Plasma] Segfault when retrieving RecordBatch from plasma store

2018-02-21 Thread Philipp Moritz (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz updated ARROW-2195:
--
Description: 
It can be reproduced with the following script:

{code:python}
import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
client = plasma.connect('test', "", 0)

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))

[buff] = client .get_buffers([pid])
batch = pa.RecordBatchStreamReader(buff).read_next_batch()

print(batch)
print(batch.schema)
print(batch[0])

return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])
{code}
 

Preliminary backtrace:

 

{code}

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt
 * thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

{code}

  was:
It can be reproduced with the following script:
{code:java}
 {code}

 import pyarrow as pa
 import pyarrow.plasma as plasma

def retrieve1():
              client = plasma.connect('test', "", 0)

             key = "keynumber1keynumber1"
              pid = plasma.ObjectID(bytearray(key,'UTF-8'))

             [buff] = client .get_buffers([pid])
              batch = pa.RecordBatchStreamReader(buff).read_next_batch()

             print(batch)
              print(batch.schema)
              print(batch[0])

             return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
 test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
 pid = plasma.ObjectID(bytearray(key,'UTF-8'))
 sink = pa.MockOutputStream()
 stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
 stream_writer.write_batch(batch)
 stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
 writer = pa.RecordBatchStreamWriter(stream, batch.schema)
 writer.write_batch(batch)
 client.seal(pid)

batch = retrieve1()
 print(batch)
 print(batch.schema)
 print(batch[0])
{code:java}
 {code}
 

Preliminary backtrace:

 

```

CESS (code=1, address=0x38158)

    frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

    0x10e645800 <+32>: callq  0x10e698170               ; symbol stub for: 
PyInt_FromLong

    0x10e645805 <+37>: testq  %rax, %rax

    0x10e645808 <+40>: je     0x10e64580c               ; <+44>

(lldb) bt
 * thread #1: tid = 0xf1378e, 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28, 
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, 
address=0x38158)

  * frame #0: 0x00010e6457fc 
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

    frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*) + 133

    frame #2: 0x00010e613b25 
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

    frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

    frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx + 22305

```


> [Plasma] Segfault when retrieving RecordBatch from plasma store
> ---
>
> 

[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371828#comment-16371828
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

pitrou commented on issue #1635: ARROW-2142: [Python] Allow conversion from 
Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#issuecomment-367414689
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.101


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1833) [Java] Add accessor methods for data buffers that skip null checking

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1833:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Java] Add accessor methods for data buffers that skip null checking
> 
>
> Key: ARROW-1833
> URL: https://issues.apache.org/jira/browse/ARROW-1833
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java - Vectors
>Reporter: Wes McKinney
>Assignee: Jingyuan Wang
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2180) [C++] Remove APIs deprecated in 0.8.0 release

2018-02-21 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371993#comment-16371993
 ] 

ASF GitHub Bot commented on ARROW-2180:
---

wesm commented on issue #1638: ARROW-2180: [C++] Remove deprecated APIs from 
0.8.0 cycle
URL: https://github.com/apache/arrow/pull/1638#issuecomment-367467535
 
 
   Appveyor build: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1706


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Remove APIs deprecated in 0.8.0 release
> -
>
> Key: ARROW-2180
> URL: https://issues.apache.org/jira/browse/ARROW-2180
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2131) [Python] Serialization test fails on Windows when library has been built in place / not installed

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2131:
---

Assignee: Wes McKinney

> [Python] Serialization test fails on Windows when library has been built in 
> place / not installed
> -
>
> Key: ARROW-2131
> URL: https://issues.apache.org/jira/browse/ARROW-2131
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> I am not sure why this doesn't come up in Appveyor:
> {code}
> == FAILURES 
> ===
>  test_deserialize_buffer_in_different_process 
> _
> def test_deserialize_buffer_in_different_process():
> import tempfile
> import subprocess
> f = tempfile.NamedTemporaryFile(delete=False)
> b = pa.serialize(pa.frombuffer(b'hello')).to_buffer()
> f.write(b.to_pybytes())
> f.close()
> dir_path = os.path.dirname(os.path.realpath(__file__))
> python_file = os.path.join(dir_path, 'deserialize_buffer.py')
> >   subprocess.check_call([sys.executable, python_file, f.name])
> pyarrow\tests\test_serialization.py:596:
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _
> popenargs = (['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att'],)
> kwargs = {}, retcode = 1
> cmd = ['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']
> def check_call(*popenargs, **kwargs):
> """Run command with arguments.  Wait for command to complete.  If
> the exit code was zero then return, otherwise raise
> CalledProcessError.  The CalledProcessError object will have the
> return code in the returncode attribute.
> The arguments are the same as for the call function.  Example:
> check_call(["ls", "-l"])
> """
> retcode = call(*popenargs, **kwargs)
> if retcode:
> cmd = kwargs.get("args")
> if cmd is None:
> cmd = popenargs[0]
> >   raise CalledProcessError(retcode, cmd)
> E   subprocess.CalledProcessError: Command 
> '['C:\\Miniconda3\\envs\\pyarrow-dev\\python.exe', 
> 'C:\\Users\\wesm\\code\\arrow\\python\\pyarrow\\tests\\deserialize_buffer.py',
>  'C:\\Users\\wesm\\AppData\\Local\\Temp\\tmp1gi__att']' returned non-zero 
> exit status 1.
> C:\Miniconda3\envs\pyarrow-dev\lib\subprocess.py:291: CalledProcessError
>  Captured stderr call 
> -
> Traceback (most recent call last):
>   File "C:\Users\wesm\code\arrow\python\pyarrow\tests\deserialize_buffer.py", 
> line 22, in 
> import pyarrow as pa
> ModuleNotFoundError: No module named 'pyarrow'
> === 1 failed, 15 passed, 4 skipped in 0.40 seconds 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns

2018-02-21 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2194:

Summary: [Python] Pandas columns metadata incorrect for empty string 
columns  (was: Pandas columns metadata incorrect for empty string columns)

> [Python] Pandas columns metadata incorrect for empty string columns
> ---
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >