[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383311#comment-16383311
 ] 

Antoine Pitrou commented on ARROW-2237:
---

{{/mnt/hugepages}} exists by default here. Though there's something weird: it's 
{{/dev/hugepages}} that's mounted if I understand correctly:

{code:bash}
$ mount | \grep hugepages
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
{code}

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383296#comment-16383296
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369847180
 
 
   @MaxRis running `clcache -s` gives you aggregate statistics for the cache, 
so you can see (by the number of hits and misses) if clcache was used at all.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1391) [Python] Benchmarks for python serialization

2018-03-01 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383033#comment-16383033
 ] 

Alex Hagerman commented on ARROW-1391:
--

I see recent commits in the repo for the benchmarks. Is this still needed? If 
so any guidance on where the nightly location might be or how to look into this?

> [Python] Benchmarks for python serialization
> 
>
> Key: ARROW-1391
> URL: https://issues.apache.org/jira/browse/ARROW-1391
> Project: Apache Arrow
>  Issue Type: Wish
>Reporter: Philipp Moritz
>Priority: Minor
>
> It would be great to have a suite of relevant benchmarks for the Python 
> serialization code in ARROW-759. These could be used to guide profiling and 
> performance improvements.
> Relevant use cases include:
> - dictionaries of large numpy arrays that are used to represent weights of a 
> neural network
> - long lists of primitive types like ints, floats or strings
> - lists of user defined python objects



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382974#comment-16382974
 ] 

Philipp Moritz commented on ARROW-2237:
---

Creating the /mnt/hugepages with

```

sudo mkdir -p /mnt/hugepages
 sudo mount -t hugetlbfs -o uid=`id -u` -o gid=`id -g` none /mnt/hugepages
 sudo bash -c "echo `id -g` > /proc/sys/vm/hugetlb_shm_group"
 sudo bash -c "echo 2 > /proc/sys/vm/nr_hugepages"

```

I can't reproduce the test failure on Ubuntu.

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type

2018-03-01 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382883#comment-16382883
 ] 

Wes McKinney commented on ARROW-488:


As currently scoped, yes. This functionality is not available in 
{{arrow::compute::Cast}} though, so perhaps we can repurpose this JIRA to add 
this functionality, which may be a bit more complicated (since {{Cast}} is not 
yet able to deal with any null sentinels at all)

> [Python] Implement conversion between integer coded as floating points with 
> NaN to an Arrow integer type
> 
>
> Key: ARROW-488
> URL: https://issues.apache.org/jira/browse/ARROW-488
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> For example: if pandas has casted integer data to float, this would enable 
> the integer data to be recovered (so long as the values fall in the ~2^53 
> floating point range for exact integer representation)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2244:
---

Assignee: Wes McKinney

> [C++] Slicing NullArray should not cause the null count on the internal data 
> to be unknown
> --
>
> Key: ARROW-2244
> URL: https://issues.apache.org/jira/browse/ARROW-2244
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382875#comment-16382875
 ] 

Philipp Moritz commented on ARROW-2237:
---

Which commands did you use to create  /mnt/hugepages? (The test is skipped if 
it doesn't exist)

I can try to reproduce this on a fresh image, but steps how to reproduce on say 
an Ubuntu image would be appreciated!

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382876#comment-16382876
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on issue #1682: ARROW-2232: [Python] pyarrow.Tensor 
constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#issuecomment-369769117
 
 
   This PR is ready to go, modulo any more review comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382874#comment-16382874
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171728016
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   For sure. Didn't mean to derail this conversation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Robert Nishihara (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382868#comment-16382868
 ] 

Robert Nishihara commented on ARROW-2237:
-

Interesting, does {{/mnt/hugepages}} exist locally? If not, the test should be 
skipped. If yes, then maybe there is some permission error or something.

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382867#comment-16382867
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

wesm commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171727391
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Let's take up some prototyping in a separate patch or repo to understand 
what a pybind11-based C++ API for pyarrow would look like or how it would work. 
This is already being used in turbodbc (which uses pybind11 for its bindings -- 
see 
https://github.com/blue-yonder/turbodbc/blob/master/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp#L252)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382860#comment-16382860
 ] 

Wes McKinney commented on ARROW-2237:
-

This looks like a local failure

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382853#comment-16382853
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171726050
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1371,6 +1371,69 @@ def test_structarray(self):
 series = pd.Series(arr.to_pandas())
 tm.assert_series_equal(series, expected)
 
+def test_from_numpy(self):
+dt = np.dtype([('x', np.int32),
+   (('y_title', 'y'), np.bool_)])
+ty = pa.struct([pa.field('x', pa.int32()),
+pa.field('y', pa.bool_())])
+
+data = np.array([], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == []
+
+data = np.array([(42, True), (43, False)], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == [{'x': 42, 'y': True},
+   {'x': 43, 'y': False}]
+
+# With mask
+arr = pa.array(data, mask=np.bool_([False, True]), type=ty)
+assert arr.to_pylist() == [{'x': 42, 'y': True}, None]
+
+# Trivial struct type
+dt = np.dtype([])
+ty = pa.struct([])
+
+data = np.array([], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == []
+
+data = np.array([(), ()], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == [{}, {}]
+
+def test_from_numpy_nested(self):
+dt = np.dtype([('x', np.dtype([('xx', np.int8),
+   ('yy', np.bool_)])),
+   ('y', np.int16)])
+ty = pa.struct([pa.field('x', pa.struct([pa.field('xx', pa.int8()),
+ pa.field('yy', pa.bool_())])),
+pa.field('y', pa.int16())])
+
+data = np.array([], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == []
+
+data = np.array([((1, True), 2), ((3, False), 4)], dtype=dt)
+arr = pa.array(data, type=ty)
+assert arr.to_pylist() == [{'x': {'xx': 1, 'yy': True}, 'y': 2},
+   {'x': {'xx': 3, 'yy': False}, 'y': 4}]
+
+def test_from_numpy_bad_input(self):
+ty = pa.struct([pa.field('x', pa.int32()),
+pa.field('y', pa.bool_())])
+dt = np.dtype([('x', np.int32),
+   ('z', np.bool_)])
+
+data = np.array([], dtype=dt)
+with pytest.raises(TypeError,
+   match="Missing field 'y'"):
+pa.array(data, type=ty)
+data = np.int32([])
+with pytest.raises(TypeError,
+   match="Expected struct array"):
+pa.array(data, type=ty)
 
 Review comment:
   Per above, it may be worth writing a "large memory" test with the 
`large_memory` pytest mark (which we can run locally, but not in Travis CI) 
where we have a field that overflows the 2G in a BinaryArray so we can test the 
rechunking / splitting of the null bitmap. I guess you'll have to pass a mask 
to get some nulls to make sure the logic is correct


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.

[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382851#comment-16382851
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171721758
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const 
std::shared_ptr& data) {
   return out;
 }
 
+// --
+// Misc APIs
+
+namespace internal {
+
+std::vector RechunkArraysConsistently(
+const std::vector& groups) {
+  if (groups.size() <= 1) {
+return groups;
+  }
+  // Adjacent slices defining the desired rechunking
+  std::vector> slices;
+  // Total number of elements common to all array groups
+  int64_t total_length = -1;
+
+  {
+// Compute a vector of slices such that each array spans
+// one or more *entire* slices only
+// e.g. if group #1 has bounds {0, 2, 4, 5, 10}
+// and group #2 has bounds {0, 5, 7, 10}
+// then the computed slices are
+// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)}
+std::set bounds;
+for (auto& group : groups) {
+  int64_t cur = 0;
+  bounds.insert(cur);
+  for (auto& array : group) {
+cur += array->length();
+bounds.insert(cur);
+  }
+  if (total_length == -1) {
+total_length = cur;
+  } else {
+// XXX Should we return an error code instead?
+DCHECK_EQ(total_length, cur)
+<< "Array groups should have the same number of elements";
 
 Review comment:
   Since this API is internal, it's not necessary. Reaching this code path 
would indicate an internal programming error by the Arrow developer. Should 
this code path ever be exposed in some way to user input, then returning an 
error code would make more sense


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382849#comment-16382849
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171722310
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const 
std::shared_ptr& data) {
   return out;
 }
 
+// --
+// Misc APIs
+
+namespace internal {
+
+std::vector RechunkArraysConsistently(
+const std::vector& groups) {
+  if (groups.size() <= 1) {
+return groups;
+  }
+  // Adjacent slices defining the desired rechunking
+  std::vector> slices;
+  // Total number of elements common to all array groups
+  int64_t total_length = -1;
+
+  {
+// Compute a vector of slices such that each array spans
+// one or more *entire* slices only
+// e.g. if group #1 has bounds {0, 2, 4, 5, 10}
+// and group #2 has bounds {0, 5, 7, 10}
+// then the computed slices are
+// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)}
+std::set bounds;
+for (auto& group : groups) {
+  int64_t cur = 0;
+  bounds.insert(cur);
+  for (auto& array : group) {
+cur += array->length();
+bounds.insert(cur);
 
 Review comment:
   The complexity of this code roughly O(ncolumns * log(num chunks)). The 
algorithm in `TableBatchReader::ReadNext` is linear-time -- where it's more 
complex than what's below may be a matter of opinion


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382856#comment-16382856
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171725444
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) {
   return PushArray(result->data());
 }
 
+Status NumPyConverter::Visit(const StructType& type) {
+  std::vector sub_converters;
+  std::vector sub_arrays;
+
+  {
+PyAcquireGIL gil_lock;
+
+// Create converters for each struct type field
+if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) {
+  return Status::TypeError("Expected struct array");
+}
+
+for (auto field : type.children()) {
+  PyObject* tup = PyDict_GetItemString(dtype_->fields, 
field->name().c_str());
+  if (tup == NULL) {
+std::stringstream ss;
+ss << "Missing field '" << field->name() << "' in struct array";
+return Status::TypeError(ss.str());
+  }
+  PyArray_Descr* sub_dtype =
+  reinterpret_cast(PyTuple_GET_ITEM(tup, 0));
+  DCHECK(PyArray_DescrCheck(sub_dtype));
+  int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1)));
+  RETURN_IF_PYERROR();
+  Py_INCREF(sub_dtype); /* PyArray_GetField() steals ref */
+  PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset);
+  RETURN_IF_PYERROR();
+  sub_arrays.emplace_back(sub_array);
+  sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */, 
field->type(),
+  use_pandas_null_sentinels_);
+}
+  }
+
+  std::vector groups;
+
+  // Compute null bitmap and store it as a Null Array to include it
+  // in the rechunking below
+  {
+int64_t null_count = 0;
+if (mask_ != nullptr) {
+  RETURN_NOT_OK(InitNullBitmap());
+  null_count = MaskToBitmap(mask_, length_, null_bitmap_data_);
+}
+auto null_data = ArrayData::Make(std::make_shared(), length_,
+ {null_bitmap_}, null_count, 0);
+DCHECK_EQ(null_data->buffers.size(), 1);
+groups.push_back({std::make_shared(null_data)});
+  }
+
+  // Convert child data
+  for (auto& converter : sub_converters) {
+RETURN_NOT_OK(converter.Convert());
+groups.push_back(converter.result());
+  }
+  // Ensure the different array groups are chunked consistently
+  groups = ::arrow::internal::RechunkArraysConsistently(groups);
+
+  // Make struct array chunks by combining groups
+  size_t ngroups = groups.size();
+  size_t chunk, nchunks = groups[0].size();
+  for (chunk = 0; chunk < nchunks; chunk++) {
+// Create struct array chunk and populate it
+// First group has the null bitmaps as Null Arrays
+auto null_data = groups[0][chunk]->data();
+DCHECK_EQ(null_data->type->id(), Type::NA);
+DCHECK_EQ(null_data->buffers.size(), 1);
+
+auto arr_data = ArrayData::Make(type_, length_, null_data->null_count, 0);
 
 Review comment:
   Interacting with `data()->null_count` post-slicing can be hazardous, since 
it can be set to -1 as part of the slice operation. I just opened a bug 
https://issues.apache.org/jira/browse/ARROW-2244. 
   
   I think you also need to preserve the `offset` from each `null_data` because 
it may be sliced. The ways in which this would fail from these bugs right now 
are pretty esoteric, but it will eventually happen -- I'm not sure off hand 
what's the best way to write unit tests for this. 
   
   let me know if this is unclear as I can explain in more detail


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line

[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382848#comment-16382848
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171721221
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const 
std::shared_ptr& data) {
   return out;
 }
 
+// --
+// Misc APIs
+
+namespace internal {
+
+std::vector RechunkArraysConsistently(
+const std::vector& groups) {
+  if (groups.size() <= 1) {
+return groups;
+  }
+  // Adjacent slices defining the desired rechunking
+  std::vector> slices;
+  // Total number of elements common to all array groups
+  int64_t total_length = -1;
+
+  {
+// Compute a vector of slices such that each array spans
+// one or more *entire* slices only
+// e.g. if group #1 has bounds {0, 2, 4, 5, 10}
+// and group #2 has bounds {0, 5, 7, 10}
+// then the computed slices are
+// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)}
+std::set bounds;
+for (auto& group : groups) {
 
 Review comment:
   `const auto&` would be a bit more idiomatic


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382852#comment-16382852
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171723407
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) {
   return PushArray(result->data());
 }
 
+Status NumPyConverter::Visit(const StructType& type) {
+  std::vector sub_converters;
+  std::vector sub_arrays;
+
+  {
+PyAcquireGIL gil_lock;
+
+// Create converters for each struct type field
+if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) {
+  return Status::TypeError("Expected struct array");
+}
+
+for (auto field : type.children()) {
+  PyObject* tup = PyDict_GetItemString(dtype_->fields, 
field->name().c_str());
 
 Review comment:
   Does this function presume UTF-8 for the 2nd argument for unicode? The C API 
docs don't say https://docs.python.org/3/c-api/dict.html#c.PyDict_GetItemString


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382854#comment-16382854
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171724263
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) {
   return PushArray(result->data());
 }
 
+Status NumPyConverter::Visit(const StructType& type) {
+  std::vector sub_converters;
+  std::vector sub_arrays;
+
+  {
+PyAcquireGIL gil_lock;
+
+// Create converters for each struct type field
+if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) {
+  return Status::TypeError("Expected struct array");
+}
+
+for (auto field : type.children()) {
+  PyObject* tup = PyDict_GetItemString(dtype_->fields, 
field->name().c_str());
+  if (tup == NULL) {
+std::stringstream ss;
+ss << "Missing field '" << field->name() << "' in struct array";
+return Status::TypeError(ss.str());
+  }
+  PyArray_Descr* sub_dtype =
+  reinterpret_cast(PyTuple_GET_ITEM(tup, 0));
+  DCHECK(PyArray_DescrCheck(sub_dtype));
+  int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1)));
+  RETURN_IF_PYERROR();
+  Py_INCREF(sub_dtype); /* PyArray_GetField() steals ref */
+  PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset);
+  RETURN_IF_PYERROR();
+  sub_arrays.emplace_back(sub_array);
+  sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */, 
field->type(),
+  use_pandas_null_sentinels_);
+}
+  }
+
+  std::vector groups;
+
+  // Compute null bitmap and store it as a Null Array to include it
+  // in the rechunking below
+  {
+int64_t null_count = 0;
+if (mask_ != nullptr) {
+  RETURN_NOT_OK(InitNullBitmap());
+  null_count = MaskToBitmap(mask_, length_, null_bitmap_data_);
+}
+auto null_data = ArrayData::Make(std::make_shared(), length_,
+ {null_bitmap_}, null_count, 0);
+DCHECK_EQ(null_data->buffers.size(), 1);
+groups.push_back({std::make_shared(null_data)});
+  }
+
+  // Convert child data
+  for (auto& converter : sub_converters) {
+RETURN_NOT_OK(converter.Convert());
+groups.push_back(converter.result());
+  }
+  // Ensure the different array groups are chunked consistently
+  groups = ::arrow::internal::RechunkArraysConsistently(groups);
+
+  // Make struct array chunks by combining groups
+  size_t ngroups = groups.size();
+  size_t chunk, nchunks = groups[0].size();
+  for (chunk = 0; chunk < nchunks; chunk++) {
 
 Review comment:
   Maybe declare `size_t chunk` here and remove from previous line, for 
readability


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382855#comment-16382855
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171724042
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) {
   return PushArray(result->data());
 }
 
+Status NumPyConverter::Visit(const StructType& type) {
+  std::vector sub_converters;
+  std::vector sub_arrays;
+
+  {
+PyAcquireGIL gil_lock;
+
+// Create converters for each struct type field
+if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) {
+  return Status::TypeError("Expected struct array");
+}
+
+for (auto field : type.children()) {
+  PyObject* tup = PyDict_GetItemString(dtype_->fields, 
field->name().c_str());
+  if (tup == NULL) {
+std::stringstream ss;
+ss << "Missing field '" << field->name() << "' in struct array";
+return Status::TypeError(ss.str());
+  }
+  PyArray_Descr* sub_dtype = 
reinterpret_cast(PyTuple_GET_ITEM(tup, 0));
+  DCHECK(PyArray_DescrCheck(sub_dtype));
+  int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1)));
+  RETURN_IF_PYERROR();
+  Py_INCREF(sub_dtype);  /* PyArray_GetField() steals ref */
+  PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset);
+  RETURN_IF_PYERROR();
+  sub_arrays.emplace_back(sub_array);
+  sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */,
+  field->type(), use_pandas_null_sentinels_);
+}
+  }
+
+  std::vector groups;
+
+  // Compute null bitmap and store it as a Null Array to include it
+  // in the rechunking below
+  {
+int64_t null_count = 0;
+if (mask_ != nullptr) {
+  RETURN_NOT_OK(InitNullBitmap());
+  null_count = MaskToBitmap(mask_, length_, null_bitmap_data_);
+}
+auto null_data = ArrayData::Make(std::make_shared(), length_,
+ {null_bitmap_}, null_count, 0);
 
 Review comment:
   You could use a boolean array (which is bit-packed) to make it less hacky


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382850#comment-16382850
 ] 

ASF GitHub Bot commented on ARROW-2142:
---

wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow 
conversion from Numpy struct array
URL: https://github.com/apache/arrow/pull/1635#discussion_r171722648
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const 
std::shared_ptr& data) {
   return out;
 }
 
+// --
+// Misc APIs
+
+namespace internal {
+
+std::vector RechunkArraysConsistently(
+const std::vector& groups) {
+  if (groups.size() <= 1) {
+return groups;
+  }
+  // Adjacent slices defining the desired rechunking
+  std::vector> slices;
+  // Total number of elements common to all array groups
+  int64_t total_length = -1;
+
+  {
+// Compute a vector of slices such that each array spans
+// one or more *entire* slices only
+// e.g. if group #1 has bounds {0, 2, 4, 5, 10}
+// and group #2 has bounds {0, 5, 7, 10}
+// then the computed slices are
+// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)}
+std::set bounds;
+for (auto& group : groups) {
+  int64_t cur = 0;
+  bounds.insert(cur);
+  for (auto& array : group) {
+cur += array->length();
+bounds.insert(cur);
+  }
+  if (total_length == -1) {
+total_length = cur;
+  } else {
+// XXX Should we return an error code instead?
+DCHECK_EQ(total_length, cur)
+<< "Array groups should have the same number of elements";
+  }
+}
+if (total_length == 0) {
+  return groups;
+}
+auto it = bounds.cbegin();
+auto end = bounds.cend();
+int64_t start = *it;
+while (++it != end) {
+  int64_t stop = *it;
+  DCHECK_GE(stop, start);
+  slices.emplace_back(start, stop);
+  start = stop;
+}
+DCHECK_EQ(slices.front().first, 0);
+DCHECK_EQ(slices.back().second, total_length);
+  }
+
+  // Rechunk each array group along the computed slices
+  std::vector rechunked_groups;
+  for (auto& group : groups) {
+ArrayVector rechunked;
+int64_t cur = 0;
+auto slices_it = slices.cbegin();
+auto slices_end = slices.cend();
+
+for (auto& array : group) {
+  int64_t array_start = cur, array_stop = cur + array->length();
 
 Review comment:
   It's better for readability to put each assignment on its own line


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy struct array unimplemented
> -
>
> Key: ARROW-2142
> URL: https://issues.apache.org/jira/browse/ARROW-2142
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)]))
> >>> arr
> array([(1.5,)], dtype=[('x', ' >>> arr[0]
> (1.5,)
> >>> arr['x']
> array([1.5], dtype=float32)
> >>> arr['x'][0]
> 1.5
> >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
> Traceback (most recent call last):
>   File "", line 1, in 
>     pa.array(arr, type=pa.struct([pa.field('x', pa.float32())]))
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "error.pxi", line 77, in pyarrow.lib.check_status
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: 
> converter.Convert()
> NumPyConverter doesn't implement > conversion.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2244:

Issue Type: Bug  (was: Improvement)

> [C++] Slicing NullArray should not cause the null count on the internal data 
> to be unknown
> --
>
> Key: ARROW-2244
> URL: https://issues.apache.org/jira/browse/ARROW-2244
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown

2018-03-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2244:
---

 Summary: [C++] Slicing NullArray should not cause the null count 
on the internal data to be unknown
 Key: ARROW-2244
 URL: https://issues.apache.org/jira/browse/ARROW-2244
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table

2018-03-01 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382819#comment-16382819
 ] 

Phillip Cloud commented on ARROW-1940:
--

Taking a look at this now.

> [Python] Extra metadata gets added after multiple conversions between 
> pd.DataFrame and pa.Table
> ---
>
> Key: ARROW-1940
> URL: https://issues.apache.org/jira/browse/ARROW-1940
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Dima Ryazanov
>Assignee: Phillip Cloud
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: fail.py
>
>
> We have a unit test that verifies that loading a dataframe from a .parq file 
> and saving it back with no changes produces the same result as the original 
> file. It started failing with pyarrow 0.8.0.
> After digging into it, I discovered that after the first conversion from 
> pd.DataFrame to pa.Table, the table contains the following metadata (among 
> other things):
> {code}
> "column_indexes": [{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}]
> {code}
> However, after converting it to pd.DataFrame and back into a pa.Table for the 
> second time, the metadata gets an encoding field:
> {code}
> "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, 
> "name": null, "numpy_type": "object", "pandas_type": "unicode"}]
> {code}
> See the attached file for a test case.
> So specifically, it appears that dataframe->table->dataframe->table 
> conversion produces a different result from just dataframe->table - which I 
> think is unexpected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382814#comment-16382814
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369760475
 
 
   @pitrou, do you have an idea how to verify that clcache.exe was really used 
during compilation? I've tried with it and without, but I can't find any 
difference in output/produced results.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2243) [C++] Enable IPO/LTO

2018-03-01 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-2243:
-
Fix Version/s: (was: 0.9.0)

> [C++] Enable IPO/LTO
> 
>
> Key: ARROW-2243
> URL: https://issues.apache.org/jira/browse/ARROW-2243
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Minor
>
> We should enable interprocedural/link-time optimization. CMake >= 3.9.4 
> supports a generic way of doing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382793#comment-16382793
 ] 

Philipp Moritz commented on ARROW-2237:
---

Was this on Travis or on your local machine?

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2177.
-
Resolution: Fixed

Resolved as part of ARROW-2145 
https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278

> [C++] Remove support for specifying negative scale values in DecimalType
> 
>
> Key: ARROW-2177
> URL: https://issues.apache.org/jira/browse/ARROW-2177
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>
> Allowing both negative and positive scale makes it ambiguous what the scale 
> of a number should be when it using exponential notation, e.g., {{0.01E3}}. 
> Should that have a precision of 4 and a scale of 2 since it's specified as 2 
> points to the right of the decimal and it evaluates to 10? Or a precision of 
> 1 and a scale of -1?
> Current it's the latter, but I think it should be the former.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2160) [C++/Python] Fix decimal precision inference

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2160.
-
Resolution: Fixed

Resolved as part of ARROW-2145 
https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278

> [C++/Python] Fix decimal precision inference
> 
>
> Key: ARROW-2160
> URL: https://issues.apache.org/jira/browse/ARROW-2160
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code}
> import pyarrow as pa
> import pandas as pd
> import decimal
> df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]})
> pa.Table.from_pandas(df)
> {code}
> raises:
> {code}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into 
> precision inferred from first array element: 1
> {code}
> Looks arrow is inferring the highest precision for given column based on the 
> first cell and expecting the rest fits in. I understand this is by design but 
> from the point of view of pandas-arrow compatibility this is quite painful as 
> pandas is more flexible (as demonstrated).
> What this means is that user trying to pass pandas {{DataFrame}} with 
> {{Decimal}} column(s) to arrow {{Table}} would always have to first:
> # Find the highest precision used in (each of) that column(s)
> # Adjust the first cell of (each of) that column(s) so that it explicitly 
> uses the highest precision of that column(s)
> # Only then pass such {{DataFrame}} to {{Table.from_pandas()}}
> So given this unavoidable procedure (and assuming arrow needs to be strict 
> about the highest precision for a column) - shouldn't some similar logic be 
> part of the {{Table.from_pandas()}} directly to make this transparent?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2157) [Python] Decimal arrays cannot be constructed from Python lists

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2157.
-
Resolution: Fixed

Resolved as part of ARROW-2145 
https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278

> [Python] Decimal arrays cannot be constructed from Python lists
> ---
>
> Key: ARROW-2157
> URL: https://issues.apache.org/jira/browse/ARROW-2157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
> Fix For: 0.9.0
>
>
> {code}
> In [14]: pa.array([Decimal('1')])
> ---
> ArrowInvalid  Traceback (most recent call last)
>  in ()
> > 1 pa.array([Decimal('1')])
> array.pxi in pyarrow.lib.array()
> array.pxi in pyarrow.lib._sequence_to_array()
> error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Error inferring Arrow data type for collection of Python 
> objects. Got Python object of type Decimal but can only handle these types: 
> bool, float, integer, date, datetime, bytes, unicode
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382770#comment-16382770
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

wesm closed pull request #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.travis.yml b/.travis.yml
index a4c74657e..b1241e793 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -174,7 +174,7 @@ matrix:
 - $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh
 script:
 - $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh
-  # [OS X] C++ & glib w/ XCode 8.3 & autotools
+  # [OS X] C++ & glib w/ XCode 8.3 & autotools & homebrew
   - compiler: clang
 osx_image: xcode8.3
 os: osx
@@ -185,7 +185,8 @@ matrix:
 - BUILD_SYSTEM=autotools
 before_script:
 - if [ $ARROW_CI_C_GLIB_AFFECTED != "1" ]; then exit; fi
-- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library
+- $TRAVIS_BUILD_DIR/ci/travis_install_osx.sh
+- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library 
--homebrew
 - $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh
 script:
 - $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh
diff --git a/c_glib/Brewfile b/c_glib/Brewfile
index 9fe5c3b61..955072e1e 100644
--- a/c_glib/Brewfile
+++ b/c_glib/Brewfile
@@ -16,7 +16,7 @@
 # under the License.
 
 brew "autoconf-archive"
-brew "boost"
+brew "boost", args: ["1.65.0"]
 brew "ccache"
 brew "cmake"
 brew "git"
diff --git a/ci/travis_before_script_c_glib.sh 
b/ci/travis_before_script_c_glib.sh
index 27d1e86fd..033fbd7c6 100755
--- a/ci/travis_before_script_c_glib.sh
+++ b/ci/travis_before_script_c_glib.sh
@@ -21,9 +21,7 @@ set -ex
 
 source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh
 
-if [ $TRAVIS_OS_NAME = "osx" ]; then
-  brew update && brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
-else  # Linux
+if [ $TRAVIS_OS_NAME = "linux" ]; then
   sudo apt-get install -y -q gtk-doc-tools autoconf-archive 
libgirepository1.0-dev
 fi
 
diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh
index 17b5deb36..b9afbee78 100755
--- a/ci/travis_before_script_cpp.sh
+++ b/ci/travis_before_script_cpp.sh
@@ -22,10 +22,22 @@ set -ex
 
 source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh
 
-if [ "$1" == "--only-library" ]; then
-  only_library_mode=yes
-else
-  only_library_mode=no
+only_library_mode=no
+using_homebrew=no
+
+while true; do
+case "$1" in
+   --only-library)
+   only_library_mode=yes
+   shift ;;
+   --homebrew)
+   using_homebrew=yes
+   shift ;;
+   *) break ;;
+esac
+done
+
+if [ "$only_library_mode" == "no" ]; then
   source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh
 fi
 
@@ -78,6 +90,10 @@ if [ $TRAVIS_OS_NAME == "linux" ]; then
   -DBUILD_WARNING_LEVEL=$ARROW_BUILD_WARNING_LEVEL \
   $ARROW_CPP_DIR
 else
+if [ "$using_homebrew" = "yes" ]; then
+   # build against homebrew's boost if we're using it
+   export BOOST_ROOT=/usr/local/opt/boost
+fi
 cmake $CMAKE_COMMON_FLAGS \
   $CMAKE_OSX_FLAGS \
   -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
diff --git a/ci/travis_build_parquet_cpp.sh b/ci/travis_build_parquet_cpp.sh
index 7d2e3ab73..f64a85d62 100755
--- a/ci/travis_build_parquet_cpp.sh
+++ b/ci/travis_build_parquet_cpp.sh
@@ -38,7 +38,7 @@ cmake \
 -GNinja \
 -DCMAKE_BUILD_TYPE=debug \
 -DCMAKE_INSTALL_PREFIX=$ARROW_PYTHON_PARQUET_HOME \
--DPARQUET_BOOST_USE_SHARED=off \
+-DPARQUET_BOOST_USE_SHARED=on \
 -DPARQUET_BUILD_BENCHMARKS=off \
 -DPARQUET_BUILD_EXECUTABLES=off \
 -DPARQUET_BUILD_TESTS=off \
diff --git a/ci/travis_install_linux.sh b/ci/travis_install_linux.sh
index acee9ebcb..74fde2774 100755
--- a/ci/travis_install_linux.sh
+++ b/ci/travis_install_linux.sh
@@ -19,7 +19,7 @@
 
 sudo apt-get install -y -q \
 gdb ccache libboost-dev libboost-filesystem-dev \
-libboost-system-dev libjemalloc-dev
+libboost-system-dev libboost-regex-dev libjemalloc-dev
 
 if [ "$ARROW_TRAVIS_VALGRIND" == "1" ]; then
 sudo apt-get install -y -q valgrind
diff --git a/ci/travis_install_osx.sh b/ci/travis_install_osx.sh
new file mode 100755
index 0..b03a5f16a
--- /dev/null
+++ b/ci/travis_install_osx.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to

[jira] [Resolved] (ARROW-2153) [C++/Python] Decimal conversion not working for exponential notation

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2153.
-
Resolution: Fixed

Resolved as part of ARROW-2145 
https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278

> [C++/Python] Decimal conversion not working for exponential notation
> 
>
> Key: ARROW-2153
> URL: https://issues.apache.org/jira/browse/ARROW-2153
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('2E+1')]}))
> {code}
>  
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 350, in dataframe_to_arrays
> convert_types)]
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 349, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
> line 345, in convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
> pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
> 'E' instead.
> {code}
> In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
> {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
> application the exponential notation can be produced out of control (it is 
> actually the _normalized_ form of the decimal number) plus for some values 
> the exponential notation is the only form expressing the significance so this 
> should be accepted.
> The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
> using following transformation but that's only possible when the significance 
> information doesn't need to be kept:
> {code:java}
> def remove_exponent(d):
> return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2145.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1651
[https://github.com/apache/arrow/pull/1651]

> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382763#comment-16382763
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

wesm commented on issue #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#issuecomment-369752375
 
 
   Sweet, here is the Appveyor build: 
https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a 
quick look through and then merge


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382750#comment-16382750
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171711960
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, &converter));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Always treat Numpy's NaT as null
+TYPE == NPY_DATETIME ||
 
 Review comment:
   AFAIU There's no other way to interpret `NaT` other than `NULL` (unless 
there's a standard that defines it in a different way than "missing"). nan is 
part of the IEEE floating point specification (as I'm sure you know) and it has 
a different meaning than null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382735#comment-16382735
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171710346
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   That's fine. Was just wondering.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382734#comment-16382734
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171710263
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -501,6 +501,14 @@ def test_float_nulls(self):
 result = table.to_pandas()
 tm.assert_frame_equal(result, ex_frame)
 
+def test_float_nulls_to_ints(self):
+# ARROW-2135
+df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]})
+schema = pa.schema([pa.field("a", pa.int16(), nullable=True)])
+table = pa.Table.from_pandas(df, schema=schema)
+assert table[0].to_pylist() == [1, 2, None]
+tm.assert_frame_equal(df, table.to_pandas())
 
 Review comment:
   It looks like it's a hard cast:
   
   ```
   In [7]: pa.array([1, 2, 3.190, np.nan], type=pa.int64())
   Out[6]:
   
   [
 1,
 2,
 3,
 NA
   ]
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382733#comment-16382733
 ] 

Alex Hagerman commented on ARROW-2242:
--

I think these may be related? https://github.com/apache/arrow/issues/1677

> [Python] ParquetFile.read does not accommodate large binary data 
> -
>
> Key: ARROW-2242
> URL: https://issues.apache.org/jira/browse/ARROW-2242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.9.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2243) [C++] Enable IPO/LTO

2018-03-01 Thread Phillip Cloud (JIRA)
Phillip Cloud created ARROW-2243:


 Summary: [C++] Enable IPO/LTO
 Key: ARROW-2243
 URL: https://issues.apache.org/jira/browse/ARROW-2243
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.8.0
Reporter: Phillip Cloud
Assignee: Phillip Cloud
 Fix For: 0.9.0


We should enable interprocedural/link-time optimization. CMake >= 3.9.4 
supports a generic way of doing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception

2018-03-01 Thread Phillip Cloud (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382617#comment-16382617
 ] 

Phillip Cloud commented on ARROW-2240:
--

PR coming shortly.

> [Python] Array initialization with leading numpy nan fails with exception
> -
>
> Key: ARROW-2240
> URL: https://issues.apache.org/jira/browse/ARROW-2240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
>
>  
> Arrow initialization fails for string arrays with leading numpy NAN
> {code:java}
> import pyarrow as pa
> import numpy as np
> pa.array([np.nan, 'str'])
> # Py3: ArrowException: Unknown error: must be real number, not str
> # Py2: ArrowException: Unknown error: a float is required{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Chris Ellison (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382568#comment-16382568
 ] 

Chris Ellison edited comment on ARROW-2242 at 3/1/18 8:10 PM:
--

Related ticket is not code-related, but workflow-related in terms of 
reading/writing binary data


was (Author: leftscreencorner):
Not code-related, but workflow related in terms of reading/writing binary data.

> [Python] ParquetFile.read does not accommodate large binary data 
> -
>
> Key: ARROW-2242
> URL: https://issues.apache.org/jira/browse/ARROW-2242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.9.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Chris Ellison (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382568#comment-16382568
 ] 

Chris Ellison commented on ARROW-2242:
--

Not code-related, but workflow related in terms of reading/writing binary data.

> [Python] ParquetFile.read does not accommodate large binary data 
> -
>
> Key: ARROW-2242
> URL: https://issues.apache.org/jira/browse/ARROW-2242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.9.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Chris Ellison (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Ellison updated ARROW-2242:
-
Description: 
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()

pf = pq.ParquetFile(demo)

# pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
t2 = pf.read()

# Works, but note, there are 32 row groups, not 2 as suggested by:
# 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
t3 = pa.concat_tables(tables)

scenario()
{code}

  was:
When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()

pf = pq.ParquetFile(demo)

# pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
t2 = pf.read()

# Works, but note, there are 32 row groups, not 2 as suggested by:
# 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

#tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
#t3 = pa.concat_tables(tables)

scenario()
{code}


> [Python] ParquetFile.read does not accommodate large binary data 
> -
>
> Key: ARROW-2242
> URL: https://issues.apache.org/jira/browse/ARROW-2242
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Priority: Major
> Fix For: 0.9.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data

2018-03-01 Thread Chris Ellison (JIRA)
Chris Ellison created ARROW-2242:


 Summary: [Python] ParquetFile.read does not accommodate large 
binary data 
 Key: ARROW-2242
 URL: https://issues.apache.org/jira/browse/ARROW-2242
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Chris Ellison
 Fix For: 0.9.0


When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
due to it not creating chunked arrays. Reading each row group individually and 
then concatenating the tables works, however.

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


x = pa.array(list('1' * 2**30))

demo = 'demo.parquet'


def scenario():
t = pa.Table.from_arrays([x], ['x'])
writer = pq.ParquetWriter(demo, t.schema)
for i in range(2):
writer.write_table(t)
writer.close()

pf = pq.ParquetFile(demo)

# pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
contain more than 2147483646 bytes, have 2147483647
t2 = pf.read()

# Works, but note, there are 32 row groups, not 2 as suggested by:
# 
https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing

#tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
#t3 = pa.concat_tables(tables)

scenario()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382538#comment-16382538
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

cpcloud commented on issue #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#issuecomment-369709126
 
 
   @wesm @pitrou this is passing on travis: 
https://travis-ci.org/cpcloud/arrow/builds/347872453


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382509#comment-16382509
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171667937
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Looks like it's pretty straightforward to go back and forth over that 
boundary 
https://github.com/pybind/pybind11/blob/master/docs/advanced/pycpp/object.rst#casting-back-and-forth


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382490#comment-16382490
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699817
 
 
   I will try on my end as well


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382484#comment-16382484
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699582
 
 
   > Also, probably, usage of RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK should 
solve issue with selected compiler overwrite.
   
   Last I tried it seemed it didn't work. I might give it a try again...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382481#comment-16382481
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699161
 
 
   @pitrou it seems that we already try to use `ccache` 
[there](https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L68) if 
it's presented. I'm wondering if it will make more sense to refactor referenced 
lines and optionally use `clcache` for MSVC?
   Also, probably, usage of RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK should 
solve issue with selected compiler overwrite. 
   And it seems that starting from Cmake 3.4.0 
[CXX_COMPILER_LAUNCHER](https://cmake.org/cmake/help/v3.4/prop_tgt/LANG_COMPILER_LAUNCHER.html#prop_tgt:%3CLANG%3E_COMPILER_LAUNCHER)
 variable is available, but we stick to CMake of min ver 3.2


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382451#comment-16382451
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

wesm commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171657341
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   e.g. 
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/pyarrow_cython_example.pyx


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382449#comment-16382449
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

wesm commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171657138
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Currently, pyarrow has a _public_ Cython and C++ API. If pybind does not 
support creating public C/C++ API for thirdparty libraries to expose its 
extension types to non-Python code, it is a non-starter 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382437#comment-16382437
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369688088
 
 
   @pitrou I will check, thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag

2018-03-01 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382436#comment-16382436
 ] 

Uwe L. Korn commented on ARROW-2241:


Ah, got it!

> [Python] Simple script for running all current ASV benchmarks at a commit or 
> tag
> 
>
> Key: ARROW-2241
> URL: https://issues.apache.org/jira/browse/ARROW-2241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> The objective of this is to be able to get a graph for performance at each 
> release tag for the currently-defined benchmarks (including benchmarks that 
> did not exist in older tags)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag

2018-03-01 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382431#comment-16382431
 ] 

Wes McKinney commented on ARROW-2241:
-

{{asv run}} does not build the C++ dependencies

> [Python] Simple script for running all current ASV benchmarks at a commit or 
> tag
> 
>
> Key: ARROW-2241
> URL: https://issues.apache.org/jira/browse/ARROW-2241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> The objective of this is to be able to get a graph for performance at each 
> release tag for the currently-defined benchmarks (including benchmarks that 
> did not exist in older tags)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382417#comment-16382417
 ] 

ASF GitHub Bot commented on ARROW-2236:
---

TheNeuralBit commented on a change in pull request #1683: ARROW-2236: [JS] Add 
more complete set of predicates
URL: https://github.com/apache/arrow/pull/1683#discussion_r171646498
 
 

 ##
 File path: js/test/unit/vector-tests.ts
 ##
 @@ -18,7 +18,7 @@
 import { TextEncoder } from 'text-encoding-utf-8';
 import Arrow from '../Arrow';
 import { type, TypedArray, TypedArrayConstructor, Vector } from 
'../../src/Arrow';
-import { packBools } from '../../src/util/bit'
 
 Review comment:
   Yeah good call. My syntax checker, 
[tsuquyomi](https://github.com/Quramy/tsuquyomi), complains about the `const { 
type, Vector } = Arrow;` approach so I shied away from it, but the tests run 
just fine.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] Add more complete set of predicates
> 
>
> Key: ARROW-2236
> URL: https://issues.apache.org/jira/browse/ARROW-2236
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
>
> Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and ||
> We should also support !=, <, > at the very least



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382404#comment-16382404
 ] 

Antoine Pitrou commented on ARROW-488:
--

Is this the same as ARROW-2135, or am I missing something here?

> [Python] Implement conversion between integer coded as floating points with 
> NaN to an Arrow integer type
> 
>
> Key: ARROW-488
> URL: https://issues.apache.org/jira/browse/ARROW-488
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: Analytics
> Fix For: 0.10.0
>
>
> For example: if pandas has casted integer data to float, this would enable 
> the integer data to be recovered (so long as the values fall in the ~2^53 
> floating point range for exact integer representation)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382366#comment-16382366
 ] 

Antoine Pitrou commented on ARROW-1894:
---

A memoryview has metadata associated to it (data type, shape, strides...). 
Should it be considered a Tensor instead?

> [Python] Treat CPython memoryview or buffer objects equivalently to 
> pyarrow.Buffer in pyarrow.serialize
> ---
>
> Key: ARROW-1894
> URL: https://issues.apache.org/jira/browse/ARROW-1894
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> These should be treated as Buffer-like on serialize. We should consider how 
> to "box" the buffers as the appropriate kind of object (Buffer, memoryview, 
> etc.) when being deserialized



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2081) Hdfs client isn't fork-safe

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382361#comment-16382361
 ] 

Antoine Pitrou commented on ARROW-2081:
---

For the record, if you want decent multiprocessing performance together with 
fork safety, I would suggest using the "forkserver"  method, not "spawn".

(Note the C libhdfs3 library isn't fork-safe, so no need to try it out IMHO :-))

> Hdfs client isn't fork-safe
> ---
>
> Key: ARROW-2081
> URL: https://issues.apache.org/jira/browse/ARROW-2081
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Jim Crist
>Priority: Major
>
> Given the following script:
>  
> {code:java}
> import multiprocessing as mp
> import pyarrow as pa
> def ls(h):
>     print("calling ls")
>     return h.ls("/tmp")
> if __name__ == '__main__':
>     h = pa.hdfs.connect()
>     print("Using 'spawn'")
>     pool = mp.get_context('spawn').Pool(2)
>     results = pool.map(ls, [h, h])
>     sol = h.ls("/tmp")
>     for r in results:
>     assert r == sol
>     print("'spawn' succeeded\n")
>     print("Using 'fork'")
>     pool = mp.get_context('fork').Pool(2)
>     results = pool.map(ls, [h, h])
>     sol = h.ls("/tmp")
>     for r in results:
>     assert r == sol
>     print("'fork' succeeded")
> {code}
>  
> Results in the following output:
>  
> {code:java}
> $ python test.py
> Using 'spawn'
> calling ls
> calling ls
> 'spawn' succeeded
> Using 'fork{code}
>  
> The process then hangs, and I have to `kill -9` the forked worker processes.
>  
> I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a 
> problem with libhdfs or just arrow's use of it (a quick google search didn't 
> turn up anything useful).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382338#comment-16382338
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

pitrou commented on a change in pull request #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#discussion_r171628747
 
 

 ##
 File path: ci/travis_install_osx.sh
 ##
 @@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+brew update
+brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
 
 Review comment:
   Not really, though given the filename it might be better to avoid further 
mistakes :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382337#comment-16382337
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171628604
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   I'll open a JIRA if there isn't already one, and start a mailing list 
discussion. GitHub is getting a bit chatty.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382333#comment-16382333
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171628259
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Thus completely hiding the fact that there's a `shared_ptr` in play from 
Python users.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382335#comment-16382335
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171628429
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   To be clear, I'm advocating for the replacement of Cython with pybind11.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382329#comment-16382329
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171628034
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   > If the constructor took the C++ shared_ptr as argument and checked its 
validity, you wouldn't need to sprinkle checks in the other methods/properties.
   
   With pybind the situation is even better, because it would allow us to have 
constructors for numpy arrays and python lists with the same API e.g., 
`pa.Tensor([1])`/`pa.Tensor(np.array([1]))` without having to deal with 
initialization by hand at all.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382323#comment-16382323
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369660991
 
 
   I think that's because `CMAKE_CXX_COMPILER` forcefully overrides the 
compiler command. When using the Visual Studio generators, you traditionally 
don't need to run `vcvarsall.bat` (presumably because cmake would hardcode the 
full compiler path), but then `clcache` fails finding the compiler.
   
   So it's possible that calling `vcvarsall.bat` is all that's needed here. But 
that would also change the workflow people may be accustomed to.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382322#comment-16382322
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369660991
 
 
   I think that's because `CMAKE_CXX_COMPILER` forcefully overrides the 
compiler command. When using the Visual Studio generators, you traditionally 
don't need to run `vcvarsall.bat` (presumably because cmake would hardcode the 
full compiler path), but then `clcache` fails finding the compiler.
   
   So it's possible that calling `vcvarsall.bat` is all that's needed here. But 
that would also change the workflow may be accustomed to.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag

2018-03-01 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382321#comment-16382321
 ] 

Uwe L. Korn commented on ARROW-2241:


Isn't this what {{asv run}} is for?

> [Python] Simple script for running all current ASV benchmarks at a commit or 
> tag
> 
>
> Key: ARROW-2241
> URL: https://issues.apache.org/jira/browse/ARROW-2241
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.9.0
>
>
> The objective of this is to be able to get a graph for performance at each 
> release tag for the currently-defined benchmarks (including benchmarks that 
> did not exist in older tags)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag

2018-03-01 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2241:
---

 Summary: [Python] Simple script for running all current ASV 
benchmarks at a commit or tag
 Key: ARROW-2241
 URL: https://issues.apache.org/jira/browse/ARROW-2241
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


The objective of this is to be able to get a graph for performance at each 
release tag for the currently-defined benchmarks (including benchmarks that did 
not exist in older tags)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382303#comment-16382303
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

pitrou commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171623439
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Besides, aren't the hand-written checks mandated by the current constructor 
signature and the fact that you have to go through a classmethod to create a 
proper instance of each Cython wrapper class? If the constructor took the C++ 
`shared_ptr` as argument and checked its validity, you wouldn't need to 
sprinkle checks in the other methods/properties.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382295#comment-16382295
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

pitrou commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171621956
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   I have experience both with Cython and the Python C API, Cython is a much 
more reasonable choice to me. The effort spent on comparable features is easily 
2x or 3x larger when writing C code against the CPython API (and the 
opportunity for bugs is also much higher, given you have to deal with 
refcounting and GC details by hand). Furthemore, Cython makes it easy to use 
high-level Python features that are a major pain to emulate in plain C.
   
   Just my 2 cents :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382286#comment-16382286
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171620833
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   I would think there's much less boilerplate to write a pure C API + pybind, 
than to write a pure C API + C extensions. pybind [supports 
numpy](http://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html#) as 
well, hiding a lot of the complexity of the C APIs behind the guarantees 
provided by C++ RAII, objects, and templates.
   
   The pure C API would look the same regardless, it's really just a question 
of whether we want to take advantage of the convenience of pybind, or hand roll 
extensions where we would have to deal with reference counting and numpy's C 
API.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2237:

Fix Version/s: 0.9.0

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.9.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382265#comment-16382265
 ] 

ASF GitHub Bot commented on ARROW-2205:
---

wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object 
nulls
URL: https://github.com/apache/arrow/pull/1650#issuecomment-369649945
 
 
   Rebasing this again


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Option for integer object nulls
> 
>
> Key: ARROW-2205
> URL: https://issues.apache.org/jira/browse/ARROW-2205
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Albert Shieh
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I have a use case where the loss of precision in casting integers to floats 
> matters, and pandas supports storing integers with nulls without loss of 
> precision in object columns. However, a roundtrip through arrow will cast the 
> object columns to float columns, even though the object columns are stored in 
> arrow as integers with nulls.
> This is a minimal example demonstrating the behavior of a roundtrip:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"a": np.array([None, 1], dtype=object)})
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {code}
>   a
> 0  None
> 1 1
>  a
> 0  NaN
> 1  1.0
> {code}
> This seems to be the desired behavior, given test_int_object_nulls in 
> test_convert_pandas.
> I think it would be useful to add an option in the to_pandas methods to allow 
> integers with nulls to be returned as object columns. The option can default 
> to false in order to preserve the current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382257#comment-16382257
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

wesm commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369648904
 
 
   @MaxRis can take a look. How does the error you linked to arise? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382252#comment-16382252
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

wesm commented on issue #1682: ARROW-2232: [Python] pyarrow.Tensor constructor 
segfaults
URL: https://github.com/apache/arrow/pull/1682#issuecomment-369647759
 
 
   Test suite is failing for some reason


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382249#comment-16382249
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

wesm commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171613766
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Is pybind really an option? It seems more likely we would migrate bindings 
to plain C extensions so that we can develop a more mature public C API for 
pyarrow


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception

2018-03-01 Thread Florian Jetter (JIRA)
Florian Jetter created ARROW-2240:
-

 Summary: [Python] Array initialization with leading numpy nan 
fails with exception
 Key: ARROW-2240
 URL: https://issues.apache.org/jira/browse/ARROW-2240
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Florian Jetter


 

Arrow initialization fails for string arrays with leading numpy NAN
{code:java}
import pyarrow as pa
import numpy as np

pa.array([np.nan, 'str'])
# Py3: ArrowException: Unknown error: must be real number, not str
# Py2: ArrowException: Unknown error: a float is required{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382205#comment-16382205
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-369636633
 
 
   AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.157


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382185#comment-16382185
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171598294
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   This is really a stopgap until we can replace our Cython API with pybind11. 
Cython's inability to deal with `shared_ptr` is huge burden right now. We have 
all these handwritten checks to make sure that an object is valid, which would 
be completely unnecessary if we moved to pybind.
   
   In any event, I'll add these checks here so we can get this merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2239) [C++] Update build docs for Windows

2018-03-01 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-2239:
-

Assignee: Antoine Pitrou

> [C++] Update build docs for Windows
> ---
>
> Key: ARROW-2239
> URL: https://issues.apache.org/jira/browse/ARROW-2239
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 0.9.0
>
>
> We should update the C++ build docs for Windows to recommend use of Ninja and 
> clcache for faster builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382179#comment-16382179
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

cpcloud commented on a change in pull request #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#discussion_r171597154
 
 

 ##
 File path: ci/travis_install_osx.sh
 ##
 @@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+brew update
+brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
 
 Review comment:
   @pitrou This is already conditioned on in `.travis.yml` just before this 
script is called. Is it really necessary to condition on it again?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382155#comment-16382155
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

cpcloud commented on a change in pull request #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#discussion_r171594338
 
 

 ##
 File path: ci/travis_install_osx.sh
 ##
 @@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+brew update
+brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
 
 Review comment:
   Yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2239) [C++] Update build docs for Windows

2018-03-01 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2239:
-

 Summary: [C++] Update build docs for Windows
 Key: ARROW-2239
 URL: https://issues.apache.org/jira/browse/ARROW-2239
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, Documentation
Reporter: Antoine Pitrou
 Fix For: 0.9.0


We should update the C++ build docs for Windows to recommend use of Ninja and 
clcache for faster builds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2020) [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382116#comment-16382116
 ] 

Antoine Pitrou commented on ARROW-2020:
---

Ok. Here the changeset does "fix" the crash somehow, but it still produces 
bogus results.

This issue might be related to ARROW-2026, in that when you pass 
{{coerce_timestamps}}, {{write_table}} seems to save the timestamps as int64 
rather than int96.

> [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit 
> timestamps
> --
>
> Key: ARROW-2020
> URL: https://issues.apache.org/jira/browse/ARROW-2020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: timestamps
> Fix For: 0.9.0
>
> Attachments: crash-report.txt
>
>
> If you try to write a PyArrow table containing nanosecond-resolution 
> timestamps to Parquet using `coerce_timestamps` and 
> `use_deprecated_int96_timestamps=True`, the Arrow library will segfault.
> The crash doesn't happen if you don't coerce the timestamp resolution or if 
> you don't use 96-bit timestamps.
>  
>  
> *To Reproduce:*
>  
> {code:java}
>  
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('ns')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> coerce_timestamps='us',  # 'ms' works too
> use_deprecated_int96_timestamps=True){code}
>  
> See attached file for the crash report.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns

2018-03-01 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn closed ARROW-2194.
--
Resolution: Not A Problem

> [Python] Pandas columns metadata incorrect for empty string columns
> ---
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381897#comment-16381897
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369571901
 
 
   Also I'm not sure whether we have a Windows developer on board; I'm merely 
launching a VM from time to time but otherwise work on Ubuntu :-)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381895#comment-16381895
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369571559
 
 
   The failure at 
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.155/job/q31movster4v84d9 
shows this can lead to inconsistencies or errors: cmake first tries to detect 
the compller from user-supplied information (generator, environment variables), 
then the clcache setting overrides that detection.
   
   Either we add logic to try and avoid such errors, or we simply let people 
override CC/CXX if they want to use clcache (statu quo).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381881#comment-16381881
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in 
cmake configuration
URL: https://github.com/apache/arrow/pull/1684#issuecomment-369569373
 
 
   Example AppVeyor build at 
https://ci.appveyor.com/project/pitrou/arrow/build/1.0.155


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2238:
--
Labels: pull-request-available  (was: )

> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381880#comment-16381880
 ] 

ASF GitHub Bot commented on ARROW-2238:
---

pitrou opened a new pull request #1684: ARROW-2238: [C++] Detect and use 
clcache in cmake configuration
URL: https://github.com/apache/arrow/pull/1684
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Detect clcache in cmake configuration
> ---
>
> Key: ARROW-2238
> URL: https://issues.apache.org/jira/browse/ARROW-2238
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2238) [C++] Detect clcache in cmake configuration

2018-03-01 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2238:
-

 Summary: [C++] Detect clcache in cmake configuration
 Key: ARROW-2238
 URL: https://issues.apache.org/jira/browse/ARROW-2238
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


By default Windows builds should use clcache if installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381816#comment-16381816
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when 
casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#issuecomment-369552237
 
 
   I addressed some review comments now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381796#comment-16381796
 ] 

ASF GitHub Bot commented on ARROW-2232:
---

pitrou commented on a change in pull request #1682: ARROW-2232: [Python] 
pyarrow.Tensor constructor segfaults
URL: https://github.com/apache/arrow/pull/1682#discussion_r171513562
 
 

 ##
 File path: python/pyarrow/array.pxi
 ##
 @@ -497,10 +497,15 @@ cdef class Tensor:
 self.type = pyarrow_wrap_data_type(self.tp.type())
 
 def __repr__(self):
+if self.tp is NULL:
 
 Review comment:
   Having `__repr__` raise isn't really nice, because it breaks debugging. It 
would be better to return something like ``. Also you 
probably want to protect other methods, and raise there if the object isn't 
initialized.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults

2018-03-01 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2232:
--
Labels: pull-request-available  (was: )

> [Python] pyarrow.Tensor constructor segfaults
> -
>
> Key: ARROW-2232
> URL: https://issues.apache.org/jira/browse/ARROW-2232
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the 
> interpreter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381773#comment-16381773
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171509916
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, 
uint8_t* bitmap) {
   return null_count;
 }
 
+class NumPyNullsConverter {
+ public:
+  /// Convert the given array's null values to a null bitmap.
+  /// The null bitmap is only allocated if null values are ever possible.
+  static Status Convert(MemoryPool* pool, PyArrayObject* arr,
+bool use_pandas_null_sentinels,
+std::shared_ptr* out_null_bitmap_,
+int64_t* out_null_count) {
+NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels);
+RETURN_NOT_OK(VisitNumpyArrayInline(arr, &converter));
+*out_null_bitmap_ = converter.null_bitmap_;
+*out_null_count = converter.null_count_;
+return Status::OK();
+  }
+
+  template 
+  Status Visit(PyArrayObject* arr) {
+typedef internal::npy_traits traits;
+
+const bool null_sentinels_possible =
+// Always treat Numpy's NaT as null
+TYPE == NPY_DATETIME ||
 
 Review comment:
   By the way, I don't know what that is, but this is required to have the 
tests pass. Why do we always treat NaT as null but not floating-point NaN? 
@wesm 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381767#comment-16381767
 ] 

Antoine Pitrou commented on ARROW-2237:
---

[~pcmoritz]

> [Python] Huge tables test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> &type, buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2237) [Python] Huge tables test failure

2018-03-01 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2237:
-

 Summary: [Python] Huge tables test failure
 Key: ARROW-2237
 URL: https://issues.apache.org/jira/browse/ARROW-2237
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Antoine Pitrou


This is a new failure here (Ubuntu 16.04, x86-64):
{code}
_ test_use_huge_pages _
Traceback (most recent call last):
  File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, in 
test_use_huge_pages
create_object(plasma_client, 1)
  File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
create_object
seal=seal)
  File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
create_object_with_id
memory_buffer = client.create(object_id, data_size, metadata)
  File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
  File "error.pxi", line 79, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer)
/home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
&type, buffer)
Encountered unexpected EOF
 Captured stderr call -
Allowing the Plasma store to use up to 0.1GB of memory.
Starting object store with directory /mnt/hugepages and huge page support 
enabled
create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381745#comment-16381745
 ] 

ASF GitHub Bot commented on ARROW-2145:
---

pitrou commented on a change in pull request #1651: 
ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal 
conversion not working for NaN values
URL: https://github.com/apache/arrow/pull/1651#discussion_r171504185
 
 

 ##
 File path: ci/travis_install_osx.sh
 ##
 @@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+brew update
+brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile
 
 Review comment:
   Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Decimal conversion not working for NaN values
> --
>
> Key: ARROW-2145
> URL: https://issues.apache.org/jira/browse/ARROW-2145
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Antony Mayi
>Assignee: Phillip Cloud
>Priority: Major
>  Labels: pull-request-available
>
> {code:python}
> import pyarrow as pa
> import pandas as pd
> import decimal
> pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
> decimal.Decimal('NaN')]}))
> {code}
> throws following exception:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in 
> dataframe_to_arrays
> convert_types)]
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in 
> 
> for c, t in zip(columns_to_convert,
>   File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in 
> convert_column
> return pa.array(col, from_pandas=True, type=ty)
>   File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
>   File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
>   File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status 
> (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068)
> pyarrow.lib.ArrowException: Unknown error: an integer is required (got type 
> str)
> {code}
> Same problem with other special decimal values like {{infinity}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas

2018-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381742#comment-16381742
 ] 

ASF GitHub Bot commented on ARROW-2135:
---

pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix 
NaN conversion when casting from Numpy array
URL: https://github.com/apache/arrow/pull/1681#discussion_r171503669
 
 

 ##
 File path: cpp/src/arrow/python/type_traits.h
 ##
 @@ -127,8 +134,14 @@ template <>
 struct npy_traits {
   typedef PyObject* value_type;
   static constexpr bool supports_nulls = true;
+
+  static inline bool isnull(PyObject* v) { return v != Py_None; }
 
 Review comment:
   Nice catch :-) I'm not sure how to test it. Defining `isnull` is necessary 
for compiling, but that path isn't taken at runtime as object arrays are 
handled separately.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] NaN values silently casted to int64 when passing explicit schema for 
> conversion in Table.from_pandas
> -
>
> Key: ARROW-2135
> URL: https://issues.apache.org/jira/browse/ARROW-2135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Matthew Gilbert
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the 
> NaN is improperly cast. Since pandas casts these to floats, when converted to 
> a table the NaN is interpreted as an integer. This seems like a bug since a 
> known limitation in pandas (the inability to have null valued integers data) 
> is taking precedence over arrow's functionality to store these as an IntArray 
> with nulls.
>  
> {code}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({"a":[1, 2, pd.np.NaN]})
> schema = pa.schema([pa.field("a", pa.int64(), nullable=True)])
> table = pa.Table.from_pandas(df, schema=schema)
> table[0]
> 
> chunk 0: 
> [
>   1,
>   2,
>   -9223372036854775808
> ]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns

2018-03-01 Thread Florian Jetter (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381739#comment-16381739
 ] 

Florian Jetter commented on ARROW-2194:
---

I haven't checked the master but on `0.8.0` all other column types write their 
pandas type explicitly although the df is empty. I do not have any objections 
as long as this behavior is consistent across dtypes

> [Python] Pandas columns metadata incorrect for empty string columns
> ---
>
> Key: ARROW-2194
> URL: https://issues.apache.org/jira/browse/ARROW-2194
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Florian Jetter
>Priority: Minor
> Fix For: 0.9.0
>
>
> The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas 
> DataFrame is unexpectedly {{float64}}
>  
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import json
> empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': 
> np.array([], dtype=np.bytes_)})
> empty_table = pa.Table.from_pandas(empty_df)
> json.loads(empty_table.schema.metadata[b'pandas'])['columns']
> # Same behavior for input dtype np.unicode_
> [{u'field_name': u'bytes',
> u'metadata': None,
> u'name': u'bytes',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'unicode',
> u'metadata': None,
> u'name': u'unicode',
> u'numpy_type': u'object',
> u'pandas_type': u'float64'},
> {u'field_name': u'__index_level_0__',
> u'metadata': None,
> u'name': None,
> u'numpy_type': u'int64',
> u'pandas_type': u'int64'}]{code}
>  
> Tested on Debian 8 with python2.7 and python 3.6.4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)