[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383311#comment-16383311 ] Antoine Pitrou commented on ARROW-2237: --- {{/mnt/hugepages}} exists by default here. Though there's something weird: it's {{/dev/hugepages}} that's mounted if I understand correctly: {code:bash} $ mount | \grep hugepages hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) {code} > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383296#comment-16383296 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369847180 @MaxRis running `clcache -s` gives you aggregate statistics for the cache, so you can see (by the number of hits and misses) if clcache was used at all. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1391) [Python] Benchmarks for python serialization
[ https://issues.apache.org/jira/browse/ARROW-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383033#comment-16383033 ] Alex Hagerman commented on ARROW-1391: -- I see recent commits in the repo for the benchmarks. Is this still needed? If so any guidance on where the nightly location might be or how to look into this? > [Python] Benchmarks for python serialization > > > Key: ARROW-1391 > URL: https://issues.apache.org/jira/browse/ARROW-1391 > Project: Apache Arrow > Issue Type: Wish >Reporter: Philipp Moritz >Priority: Minor > > It would be great to have a suite of relevant benchmarks for the Python > serialization code in ARROW-759. These could be used to guide profiling and > performance improvements. > Relevant use cases include: > - dictionaries of large numpy arrays that are used to represent weights of a > neural network > - long lists of primitive types like ints, floats or strings > - lists of user defined python objects -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382974#comment-16382974 ] Philipp Moritz commented on ARROW-2237: --- Creating the /mnt/hugepages with ``` sudo mkdir -p /mnt/hugepages sudo mount -t hugetlbfs -o uid=`id -u` -o gid=`id -g` none /mnt/hugepages sudo bash -c "echo `id -g` > /proc/sys/vm/hugetlb_shm_group" sudo bash -c "echo 2 > /proc/sys/vm/nr_hugepages" ``` I can't reproduce the test failure on Ubuntu. > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type
[ https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382883#comment-16382883 ] Wes McKinney commented on ARROW-488: As currently scoped, yes. This functionality is not available in {{arrow::compute::Cast}} though, so perhaps we can repurpose this JIRA to add this functionality, which may be a bit more complicated (since {{Cast}} is not yet able to deal with any null sentinels at all) > [Python] Implement conversion between integer coded as floating points with > NaN to an Arrow integer type > > > Key: ARROW-488 > URL: https://issues.apache.org/jira/browse/ARROW-488 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 0.10.0 > > > For example: if pandas has casted integer data to float, this would enable > the integer data to be recovered (so long as the values fall in the ~2^53 > floating point range for exact integer representation) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown
[ https://issues.apache.org/jira/browse/ARROW-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2244: --- Assignee: Wes McKinney > [C++] Slicing NullArray should not cause the null count on the internal data > to be unknown > -- > > Key: ARROW-2244 > URL: https://issues.apache.org/jira/browse/ARROW-2244 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382875#comment-16382875 ] Philipp Moritz commented on ARROW-2237: --- Which commands did you use to create /mnt/hugepages? (The test is skipped if it doesn't exist) I can try to reproduce this on a fresh image, but steps how to reproduce on say an Ubuntu image would be appreciated! > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382876#comment-16382876 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on issue #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#issuecomment-369769117 This PR is ready to go, modulo any more review comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382874#comment-16382874 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171728016 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: For sure. Didn't mean to derail this conversation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382868#comment-16382868 ] Robert Nishihara commented on ARROW-2237: - Interesting, does {{/mnt/hugepages}} exist locally? If not, the test should be skipped. If yes, then maybe there is some permission error or something. > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382867#comment-16382867 ] ASF GitHub Bot commented on ARROW-2232: --- wesm commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171727391 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Let's take up some prototyping in a separate patch or repo to understand what a pybind11-based C++ API for pyarrow would look like or how it would work. This is already being used in turbodbc (which uses pybind11 for its bindings -- see https://github.com/blue-yonder/turbodbc/blob/master/cpp/turbodbc_arrow/Library/src/arrow_result_set.cpp#L252) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382860#comment-16382860 ] Wes McKinney commented on ARROW-2237: - This looks like a local failure > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382853#comment-16382853 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171726050 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -1371,6 +1371,69 @@ def test_structarray(self): series = pd.Series(arr.to_pandas()) tm.assert_series_equal(series, expected) +def test_from_numpy(self): +dt = np.dtype([('x', np.int32), + (('y_title', 'y'), np.bool_)]) +ty = pa.struct([pa.field('x', pa.int32()), +pa.field('y', pa.bool_())]) + +data = np.array([], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [] + +data = np.array([(42, True), (43, False)], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [{'x': 42, 'y': True}, + {'x': 43, 'y': False}] + +# With mask +arr = pa.array(data, mask=np.bool_([False, True]), type=ty) +assert arr.to_pylist() == [{'x': 42, 'y': True}, None] + +# Trivial struct type +dt = np.dtype([]) +ty = pa.struct([]) + +data = np.array([], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [] + +data = np.array([(), ()], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [{}, {}] + +def test_from_numpy_nested(self): +dt = np.dtype([('x', np.dtype([('xx', np.int8), + ('yy', np.bool_)])), + ('y', np.int16)]) +ty = pa.struct([pa.field('x', pa.struct([pa.field('xx', pa.int8()), + pa.field('yy', pa.bool_())])), +pa.field('y', pa.int16())]) + +data = np.array([], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [] + +data = np.array([((1, True), 2), ((3, False), 4)], dtype=dt) +arr = pa.array(data, type=ty) +assert arr.to_pylist() == [{'x': {'xx': 1, 'yy': True}, 'y': 2}, + {'x': {'xx': 3, 'yy': False}, 'y': 4}] + +def test_from_numpy_bad_input(self): +ty = pa.struct([pa.field('x', pa.int32()), +pa.field('y', pa.bool_())]) +dt = np.dtype([('x', np.int32), + ('z', np.bool_)]) + +data = np.array([], dtype=dt) +with pytest.raises(TypeError, + match="Missing field 'y'"): +pa.array(data, type=ty) +data = np.int32([]) +with pytest.raises(TypeError, + match="Expected struct array"): +pa.array(data, type=ty) Review comment: Per above, it may be worth writing a "large memory" test with the `large_memory` pytest mark (which we can run locally, but not in Travis CI) where we have a field that overflows the 2G in a BinaryArray so we can test the rechunking / splitting of the null bitmap. I guess you'll have to pass a mask to get some nulls to make sure the logic is correct This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382851#comment-16382851 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171721758 ## File path: cpp/src/arrow/array.cc ## @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const std::shared_ptr& data) { return out; } +// -- +// Misc APIs + +namespace internal { + +std::vector RechunkArraysConsistently( +const std::vector& groups) { + if (groups.size() <= 1) { +return groups; + } + // Adjacent slices defining the desired rechunking + std::vector> slices; + // Total number of elements common to all array groups + int64_t total_length = -1; + + { +// Compute a vector of slices such that each array spans +// one or more *entire* slices only +// e.g. if group #1 has bounds {0, 2, 4, 5, 10} +// and group #2 has bounds {0, 5, 7, 10} +// then the computed slices are +// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)} +std::set bounds; +for (auto& group : groups) { + int64_t cur = 0; + bounds.insert(cur); + for (auto& array : group) { +cur += array->length(); +bounds.insert(cur); + } + if (total_length == -1) { +total_length = cur; + } else { +// XXX Should we return an error code instead? +DCHECK_EQ(total_length, cur) +<< "Array groups should have the same number of elements"; Review comment: Since this API is internal, it's not necessary. Reaching this code path would indicate an internal programming error by the Arrow developer. Should this code path ever be exposed in some way to user input, then returning an error code would make more sense This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382849#comment-16382849 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171722310 ## File path: cpp/src/arrow/array.cc ## @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const std::shared_ptr& data) { return out; } +// -- +// Misc APIs + +namespace internal { + +std::vector RechunkArraysConsistently( +const std::vector& groups) { + if (groups.size() <= 1) { +return groups; + } + // Adjacent slices defining the desired rechunking + std::vector> slices; + // Total number of elements common to all array groups + int64_t total_length = -1; + + { +// Compute a vector of slices such that each array spans +// one or more *entire* slices only +// e.g. if group #1 has bounds {0, 2, 4, 5, 10} +// and group #2 has bounds {0, 5, 7, 10} +// then the computed slices are +// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)} +std::set bounds; +for (auto& group : groups) { + int64_t cur = 0; + bounds.insert(cur); + for (auto& array : group) { +cur += array->length(); +bounds.insert(cur); Review comment: The complexity of this code roughly O(ncolumns * log(num chunks)). The algorithm in `TableBatchReader::ReadNext` is linear-time -- where it's more complex than what's below may be a matter of opinion This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382856#comment-16382856 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171725444 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) { return PushArray(result->data()); } +Status NumPyConverter::Visit(const StructType& type) { + std::vector sub_converters; + std::vector sub_arrays; + + { +PyAcquireGIL gil_lock; + +// Create converters for each struct type field +if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) { + return Status::TypeError("Expected struct array"); +} + +for (auto field : type.children()) { + PyObject* tup = PyDict_GetItemString(dtype_->fields, field->name().c_str()); + if (tup == NULL) { +std::stringstream ss; +ss << "Missing field '" << field->name() << "' in struct array"; +return Status::TypeError(ss.str()); + } + PyArray_Descr* sub_dtype = + reinterpret_cast(PyTuple_GET_ITEM(tup, 0)); + DCHECK(PyArray_DescrCheck(sub_dtype)); + int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1))); + RETURN_IF_PYERROR(); + Py_INCREF(sub_dtype); /* PyArray_GetField() steals ref */ + PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset); + RETURN_IF_PYERROR(); + sub_arrays.emplace_back(sub_array); + sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */, field->type(), + use_pandas_null_sentinels_); +} + } + + std::vector groups; + + // Compute null bitmap and store it as a Null Array to include it + // in the rechunking below + { +int64_t null_count = 0; +if (mask_ != nullptr) { + RETURN_NOT_OK(InitNullBitmap()); + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); +} +auto null_data = ArrayData::Make(std::make_shared(), length_, + {null_bitmap_}, null_count, 0); +DCHECK_EQ(null_data->buffers.size(), 1); +groups.push_back({std::make_shared(null_data)}); + } + + // Convert child data + for (auto& converter : sub_converters) { +RETURN_NOT_OK(converter.Convert()); +groups.push_back(converter.result()); + } + // Ensure the different array groups are chunked consistently + groups = ::arrow::internal::RechunkArraysConsistently(groups); + + // Make struct array chunks by combining groups + size_t ngroups = groups.size(); + size_t chunk, nchunks = groups[0].size(); + for (chunk = 0; chunk < nchunks; chunk++) { +// Create struct array chunk and populate it +// First group has the null bitmaps as Null Arrays +auto null_data = groups[0][chunk]->data(); +DCHECK_EQ(null_data->type->id(), Type::NA); +DCHECK_EQ(null_data->buffers.size(), 1); + +auto arr_data = ArrayData::Make(type_, length_, null_data->null_count, 0); Review comment: Interacting with `data()->null_count` post-slicing can be hazardous, since it can be set to -1 as part of the slice operation. I just opened a bug https://issues.apache.org/jira/browse/ARROW-2244. I think you also need to preserve the `offset` from each `null_data` because it may be sliced. The ways in which this would fail from these bugs right now are pretty esoteric, but it will eventually happen -- I'm not sure off hand what's the best way to write unit tests for this. let me know if this is unclear as I can explain in more detail This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382848#comment-16382848 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171721221 ## File path: cpp/src/arrow/array.cc ## @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const std::shared_ptr& data) { return out; } +// -- +// Misc APIs + +namespace internal { + +std::vector RechunkArraysConsistently( +const std::vector& groups) { + if (groups.size() <= 1) { +return groups; + } + // Adjacent slices defining the desired rechunking + std::vector> slices; + // Total number of elements common to all array groups + int64_t total_length = -1; + + { +// Compute a vector of slices such that each array spans +// one or more *entire* slices only +// e.g. if group #1 has bounds {0, 2, 4, 5, 10} +// and group #2 has bounds {0, 5, 7, 10} +// then the computed slices are +// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)} +std::set bounds; +for (auto& group : groups) { Review comment: `const auto&` would be a bit more idiomatic This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382852#comment-16382852 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171723407 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) { return PushArray(result->data()); } +Status NumPyConverter::Visit(const StructType& type) { + std::vector sub_converters; + std::vector sub_arrays; + + { +PyAcquireGIL gil_lock; + +// Create converters for each struct type field +if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) { + return Status::TypeError("Expected struct array"); +} + +for (auto field : type.children()) { + PyObject* tup = PyDict_GetItemString(dtype_->fields, field->name().c_str()); Review comment: Does this function presume UTF-8 for the 2nd argument for unicode? The C API docs don't say https://docs.python.org/3/c-api/dict.html#c.PyDict_GetItemString This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382854#comment-16382854 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171724263 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) { return PushArray(result->data()); } +Status NumPyConverter::Visit(const StructType& type) { + std::vector sub_converters; + std::vector sub_arrays; + + { +PyAcquireGIL gil_lock; + +// Create converters for each struct type field +if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) { + return Status::TypeError("Expected struct array"); +} + +for (auto field : type.children()) { + PyObject* tup = PyDict_GetItemString(dtype_->fields, field->name().c_str()); + if (tup == NULL) { +std::stringstream ss; +ss << "Missing field '" << field->name() << "' in struct array"; +return Status::TypeError(ss.str()); + } + PyArray_Descr* sub_dtype = + reinterpret_cast(PyTuple_GET_ITEM(tup, 0)); + DCHECK(PyArray_DescrCheck(sub_dtype)); + int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1))); + RETURN_IF_PYERROR(); + Py_INCREF(sub_dtype); /* PyArray_GetField() steals ref */ + PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset); + RETURN_IF_PYERROR(); + sub_arrays.emplace_back(sub_array); + sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */, field->type(), + use_pandas_null_sentinels_); +} + } + + std::vector groups; + + // Compute null bitmap and store it as a Null Array to include it + // in the rechunking below + { +int64_t null_count = 0; +if (mask_ != nullptr) { + RETURN_NOT_OK(InitNullBitmap()); + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); +} +auto null_data = ArrayData::Make(std::make_shared(), length_, + {null_bitmap_}, null_count, 0); +DCHECK_EQ(null_data->buffers.size(), 1); +groups.push_back({std::make_shared(null_data)}); + } + + // Convert child data + for (auto& converter : sub_converters) { +RETURN_NOT_OK(converter.Convert()); +groups.push_back(converter.result()); + } + // Ensure the different array groups are chunked consistently + groups = ::arrow::internal::RechunkArraysConsistently(groups); + + // Make struct array chunks by combining groups + size_t ngroups = groups.size(); + size_t chunk, nchunks = groups[0].size(); + for (chunk = 0; chunk < nchunks; chunk++) { Review comment: Maybe declare `size_t chunk` here and remove from previous line, for readability This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382855#comment-16382855 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171724042 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -1590,6 +1592,85 @@ Status NumPyConverter::Visit(const StringType& type) { return PushArray(result->data()); } +Status NumPyConverter::Visit(const StructType& type) { + std::vector sub_converters; + std::vector sub_arrays; + + { +PyAcquireGIL gil_lock; + +// Create converters for each struct type field +if (dtype_->fields == NULL || !PyDict_Check(dtype_->fields)) { + return Status::TypeError("Expected struct array"); +} + +for (auto field : type.children()) { + PyObject* tup = PyDict_GetItemString(dtype_->fields, field->name().c_str()); + if (tup == NULL) { +std::stringstream ss; +ss << "Missing field '" << field->name() << "' in struct array"; +return Status::TypeError(ss.str()); + } + PyArray_Descr* sub_dtype = reinterpret_cast(PyTuple_GET_ITEM(tup, 0)); + DCHECK(PyArray_DescrCheck(sub_dtype)); + int offset = static_cast(PyLong_AsLong(PyTuple_GET_ITEM(tup, 1))); + RETURN_IF_PYERROR(); + Py_INCREF(sub_dtype); /* PyArray_GetField() steals ref */ + PyObject* sub_array = PyArray_GetField(arr_, sub_dtype, offset); + RETURN_IF_PYERROR(); + sub_arrays.emplace_back(sub_array); + sub_converters.emplace_back(pool_, sub_array, nullptr /* mask */, + field->type(), use_pandas_null_sentinels_); +} + } + + std::vector groups; + + // Compute null bitmap and store it as a Null Array to include it + // in the rechunking below + { +int64_t null_count = 0; +if (mask_ != nullptr) { + RETURN_NOT_OK(InitNullBitmap()); + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); +} +auto null_data = ArrayData::Make(std::make_shared(), length_, + {null_bitmap_}, null_count, 0); Review comment: You could use a boolean array (which is bit-packed) to make it less hacky This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2142) [Python] Conversion from Numpy struct array unimplemented
[ https://issues.apache.org/jira/browse/ARROW-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382850#comment-16382850 ] ASF GitHub Bot commented on ARROW-2142: --- wesm commented on a change in pull request #1635: ARROW-2142: [Python] Allow conversion from Numpy struct array URL: https://github.com/apache/arrow/pull/1635#discussion_r171722648 ## File path: cpp/src/arrow/array.cc ## @@ -772,6 +773,105 @@ std::shared_ptr MakeArray(const std::shared_ptr& data) { return out; } +// -- +// Misc APIs + +namespace internal { + +std::vector RechunkArraysConsistently( +const std::vector& groups) { + if (groups.size() <= 1) { +return groups; + } + // Adjacent slices defining the desired rechunking + std::vector> slices; + // Total number of elements common to all array groups + int64_t total_length = -1; + + { +// Compute a vector of slices such that each array spans +// one or more *entire* slices only +// e.g. if group #1 has bounds {0, 2, 4, 5, 10} +// and group #2 has bounds {0, 5, 7, 10} +// then the computed slices are +// {(0, 2), (2, 4), (4, 5), (5, 7), (7, 10)} +std::set bounds; +for (auto& group : groups) { + int64_t cur = 0; + bounds.insert(cur); + for (auto& array : group) { +cur += array->length(); +bounds.insert(cur); + } + if (total_length == -1) { +total_length = cur; + } else { +// XXX Should we return an error code instead? +DCHECK_EQ(total_length, cur) +<< "Array groups should have the same number of elements"; + } +} +if (total_length == 0) { + return groups; +} +auto it = bounds.cbegin(); +auto end = bounds.cend(); +int64_t start = *it; +while (++it != end) { + int64_t stop = *it; + DCHECK_GE(stop, start); + slices.emplace_back(start, stop); + start = stop; +} +DCHECK_EQ(slices.front().first, 0); +DCHECK_EQ(slices.back().second, total_length); + } + + // Rechunk each array group along the computed slices + std::vector rechunked_groups; + for (auto& group : groups) { +ArrayVector rechunked; +int64_t cur = 0; +auto slices_it = slices.cbegin(); +auto slices_end = slices.cend(); + +for (auto& array : group) { + int64_t array_start = cur, array_stop = cur + array->length(); Review comment: It's better for readability to put each assignment on its own line This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Conversion from Numpy struct array unimplemented > - > > Key: ARROW-2142 > URL: https://issues.apache.org/jira/browse/ARROW-2142 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> arr = np.array([(1.5,)], dtype=np.dtype([('x', np.float32)])) > >>> arr > array([(1.5,)], dtype=[('x', ' >>> arr[0] > (1.5,) > >>> arr['x'] > array([1.5], dtype=float32) > >>> arr['x'][0] > 1.5 > >>> pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > Traceback (most recent call last): > File "", line 1, in > pa.array(arr, type=pa.struct([pa.field('x', pa.float32())])) > File "array.pxi", line 177, in pyarrow.lib.array > File "error.pxi", line 77, in pyarrow.lib.check_status > File "error.pxi", line 85, in pyarrow.lib.check_status > ArrowNotImplementedError: > /home/antoine/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:1585 code: > converter.Convert() > NumPyConverter doesn't implement > conversion. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown
[ https://issues.apache.org/jira/browse/ARROW-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2244: Issue Type: Bug (was: Improvement) > [C++] Slicing NullArray should not cause the null count on the internal data > to be unknown > -- > > Key: ARROW-2244 > URL: https://issues.apache.org/jira/browse/ARROW-2244 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2244) [C++] Slicing NullArray should not cause the null count on the internal data to be unknown
Wes McKinney created ARROW-2244: --- Summary: [C++] Slicing NullArray should not cause the null count on the internal data to be unknown Key: ARROW-2244 URL: https://issues.apache.org/jira/browse/ARROW-2244 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.9.0 see https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.cc#L101 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
[ https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382819#comment-16382819 ] Phillip Cloud commented on ARROW-1940: -- Taking a look at this now. > [Python] Extra metadata gets added after multiple conversions between > pd.DataFrame and pa.Table > --- > > Key: ARROW-1940 > URL: https://issues.apache.org/jira/browse/ARROW-1940 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Dima Ryazanov >Assignee: Phillip Cloud >Priority: Minor > Fix For: 0.9.0 > > Attachments: fail.py > > > We have a unit test that verifies that loading a dataframe from a .parq file > and saving it back with no changes produces the same result as the original > file. It started failing with pyarrow 0.8.0. > After digging into it, I discovered that after the first conversion from > pd.DataFrame to pa.Table, the table contains the following metadata (among > other things): > {code} > "column_indexes": [{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}] > {code} > However, after converting it to pd.DataFrame and back into a pa.Table for the > second time, the metadata gets an encoding field: > {code} > "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, > "name": null, "numpy_type": "object", "pandas_type": "unicode"}] > {code} > See the attached file for a test case. > So specifically, it appears that dataframe->table->dataframe->table > conversion produces a different result from just dataframe->table - which I > think is unexpected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382814#comment-16382814 ] ASF GitHub Bot commented on ARROW-2238: --- MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369760475 @pitrou, do you have an idea how to verify that clcache.exe was really used during compilation? I've tried with it and without, but I can't find any difference in output/produced results. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2243) [C++] Enable IPO/LTO
[ https://issues.apache.org/jira/browse/ARROW-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phillip Cloud updated ARROW-2243: - Fix Version/s: (was: 0.9.0) > [C++] Enable IPO/LTO > > > Key: ARROW-2243 > URL: https://issues.apache.org/jira/browse/ARROW-2243 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Minor > > We should enable interprocedural/link-time optimization. CMake >= 3.9.4 > supports a generic way of doing this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382793#comment-16382793 ] Philipp Moritz commented on ARROW-2237: --- Was this on Travis or on your local machine? > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2177) [C++] Remove support for specifying negative scale values in DecimalType
[ https://issues.apache.org/jira/browse/ARROW-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2177. - Resolution: Fixed Resolved as part of ARROW-2145 https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278 > [C++] Remove support for specifying negative scale values in DecimalType > > > Key: ARROW-2177 > URL: https://issues.apache.org/jira/browse/ARROW-2177 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > > Allowing both negative and positive scale makes it ambiguous what the scale > of a number should be when it using exponential notation, e.g., {{0.01E3}}. > Should that have a precision of 4 and a scale of 2 since it's specified as 2 > points to the right of the decimal and it evaluates to 10? Or a precision of > 1 and a scale of -1? > Current it's the latter, but I think it should be the former. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2160) [C++/Python] Fix decimal precision inference
[ https://issues.apache.org/jira/browse/ARROW-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2160. - Resolution: Fixed Resolved as part of ARROW-2145 https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278 > [C++/Python] Fix decimal precision inference > > > Key: ARROW-2160 > URL: https://issues.apache.org/jira/browse/ARROW-2160 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {code} > import pyarrow as pa > import pandas as pd > import decimal > df = pd.DataFrame({'a': [decimal.Decimal('0.1'), decimal.Decimal('0.01')]}) > pa.Table.from_pandas(df) > {code} > raises: > {code} > pyarrow.lib.ArrowInvalid: Decimal type with precision 2 does not fit into > precision inferred from first array element: 1 > {code} > Looks arrow is inferring the highest precision for given column based on the > first cell and expecting the rest fits in. I understand this is by design but > from the point of view of pandas-arrow compatibility this is quite painful as > pandas is more flexible (as demonstrated). > What this means is that user trying to pass pandas {{DataFrame}} with > {{Decimal}} column(s) to arrow {{Table}} would always have to first: > # Find the highest precision used in (each of) that column(s) > # Adjust the first cell of (each of) that column(s) so that it explicitly > uses the highest precision of that column(s) > # Only then pass such {{DataFrame}} to {{Table.from_pandas()}} > So given this unavoidable procedure (and assuming arrow needs to be strict > about the highest precision for a column) - shouldn't some similar logic be > part of the {{Table.from_pandas()}} directly to make this transparent? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2157) [Python] Decimal arrays cannot be constructed from Python lists
[ https://issues.apache.org/jira/browse/ARROW-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2157. - Resolution: Fixed Resolved as part of ARROW-2145 https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278 > [Python] Decimal arrays cannot be constructed from Python lists > --- > > Key: ARROW-2157 > URL: https://issues.apache.org/jira/browse/ARROW-2157 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.9.0 > > > {code} > In [14]: pa.array([Decimal('1')]) > --- > ArrowInvalid Traceback (most recent call last) > in () > > 1 pa.array([Decimal('1')]) > array.pxi in pyarrow.lib.array() > array.pxi in pyarrow.lib._sequence_to_array() > error.pxi in pyarrow.lib.check_status() > ArrowInvalid: Error inferring Arrow data type for collection of Python > objects. Got Python object of type Decimal but can only handle these types: > bool, float, integer, date, datetime, bytes, unicode > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382770#comment-16382770 ] ASF GitHub Bot commented on ARROW-2145: --- wesm closed pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.travis.yml b/.travis.yml index a4c74657e..b1241e793 100644 --- a/.travis.yml +++ b/.travis.yml @@ -174,7 +174,7 @@ matrix: - $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh - # [OS X] C++ & glib w/ XCode 8.3 & autotools + # [OS X] C++ & glib w/ XCode 8.3 & autotools & homebrew - compiler: clang osx_image: xcode8.3 os: osx @@ -185,7 +185,8 @@ matrix: - BUILD_SYSTEM=autotools before_script: - if [ $ARROW_CI_C_GLIB_AFFECTED != "1" ]; then exit; fi -- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library +- $TRAVIS_BUILD_DIR/ci/travis_install_osx.sh +- $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh --only-library --homebrew - $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh diff --git a/c_glib/Brewfile b/c_glib/Brewfile index 9fe5c3b61..955072e1e 100644 --- a/c_glib/Brewfile +++ b/c_glib/Brewfile @@ -16,7 +16,7 @@ # under the License. brew "autoconf-archive" -brew "boost" +brew "boost", args: ["1.65.0"] brew "ccache" brew "cmake" brew "git" diff --git a/ci/travis_before_script_c_glib.sh b/ci/travis_before_script_c_glib.sh index 27d1e86fd..033fbd7c6 100755 --- a/ci/travis_before_script_c_glib.sh +++ b/ci/travis_before_script_c_glib.sh @@ -21,9 +21,7 @@ set -ex source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh -if [ $TRAVIS_OS_NAME = "osx" ]; then - brew update && brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile -else # Linux +if [ $TRAVIS_OS_NAME = "linux" ]; then sudo apt-get install -y -q gtk-doc-tools autoconf-archive libgirepository1.0-dev fi diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 17b5deb36..b9afbee78 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -22,10 +22,22 @@ set -ex source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh -if [ "$1" == "--only-library" ]; then - only_library_mode=yes -else - only_library_mode=no +only_library_mode=no +using_homebrew=no + +while true; do +case "$1" in + --only-library) + only_library_mode=yes + shift ;; + --homebrew) + using_homebrew=yes + shift ;; + *) break ;; +esac +done + +if [ "$only_library_mode" == "no" ]; then source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh fi @@ -78,6 +90,10 @@ if [ $TRAVIS_OS_NAME == "linux" ]; then -DBUILD_WARNING_LEVEL=$ARROW_BUILD_WARNING_LEVEL \ $ARROW_CPP_DIR else +if [ "$using_homebrew" = "yes" ]; then + # build against homebrew's boost if we're using it + export BOOST_ROOT=/usr/local/opt/boost +fi cmake $CMAKE_COMMON_FLAGS \ $CMAKE_OSX_FLAGS \ -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ diff --git a/ci/travis_build_parquet_cpp.sh b/ci/travis_build_parquet_cpp.sh index 7d2e3ab73..f64a85d62 100755 --- a/ci/travis_build_parquet_cpp.sh +++ b/ci/travis_build_parquet_cpp.sh @@ -38,7 +38,7 @@ cmake \ -GNinja \ -DCMAKE_BUILD_TYPE=debug \ -DCMAKE_INSTALL_PREFIX=$ARROW_PYTHON_PARQUET_HOME \ --DPARQUET_BOOST_USE_SHARED=off \ +-DPARQUET_BOOST_USE_SHARED=on \ -DPARQUET_BUILD_BENCHMARKS=off \ -DPARQUET_BUILD_EXECUTABLES=off \ -DPARQUET_BUILD_TESTS=off \ diff --git a/ci/travis_install_linux.sh b/ci/travis_install_linux.sh index acee9ebcb..74fde2774 100755 --- a/ci/travis_install_linux.sh +++ b/ci/travis_install_linux.sh @@ -19,7 +19,7 @@ sudo apt-get install -y -q \ gdb ccache libboost-dev libboost-filesystem-dev \ -libboost-system-dev libjemalloc-dev +libboost-system-dev libboost-regex-dev libjemalloc-dev if [ "$ARROW_TRAVIS_VALGRIND" == "1" ]; then sudo apt-get install -y -q valgrind diff --git a/ci/travis_install_osx.sh b/ci/travis_install_osx.sh new file mode 100755 index 0..b03a5f16a --- /dev/null +++ b/ci/travis_install_osx.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to
[jira] [Resolved] (ARROW-2153) [C++/Python] Decimal conversion not working for exponential notation
[ https://issues.apache.org/jira/browse/ARROW-2153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2153. - Resolution: Fixed Resolved as part of ARROW-2145 https://github.com/apache/arrow/commit/bfac60dd73bffa5f7bcefc890486268036182278 > [C++/Python] Decimal conversion not working for exponential notation > > > Key: ARROW-2153 > URL: https://issues.apache.org/jira/browse/ARROW-2153 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:java} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('2E+1')]})) > {code} > > {code:java} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File > "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 350, in dataframe_to_arrays > convert_types)] > File > "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 349, in > for c, t in zip(columns_to_convert, > File > "/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", > line 345, in convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270) > pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found > 'E' instead. > {code} > In manual cases clearly we can write {{decimal.Decimal('20')}} instead of > {{decimal.Decimal('2E+1')}} but during arithmetical operations inside an > application the exponential notation can be produced out of control (it is > actually the _normalized_ form of the decimal number) plus for some values > the exponential notation is the only form expressing the significance so this > should be accepted. > The [documentation|https://docs.python.org/3/library/decimal.html] suggests > using following transformation but that's only possible when the significance > information doesn't need to be kept: > {code:java} > def remove_exponent(d): > return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2145. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1651 [https://github.com/apache/arrow/pull/1651] > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382763#comment-16382763 ] ASF GitHub Bot commented on ARROW-2145: --- wesm commented on issue #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#issuecomment-369752375 Sweet, here is the Appveyor build: https://ci.appveyor.com/project/cpcloud/arrow/build/1.0.587. Going to take a quick look through and then merge This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382750#comment-16382750 ] ASF GitHub Bot commented on ARROW-2135: --- cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#discussion_r171711960 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, uint8_t* bitmap) { return null_count; } +class NumPyNullsConverter { + public: + /// Convert the given array's null values to a null bitmap. + /// The null bitmap is only allocated if null values are ever possible. + static Status Convert(MemoryPool* pool, PyArrayObject* arr, +bool use_pandas_null_sentinels, +std::shared_ptr* out_null_bitmap_, +int64_t* out_null_count) { +NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels); +RETURN_NOT_OK(VisitNumpyArrayInline(arr, &converter)); +*out_null_bitmap_ = converter.null_bitmap_; +*out_null_count = converter.null_count_; +return Status::OK(); + } + + template + Status Visit(PyArrayObject* arr) { +typedef internal::npy_traits traits; + +const bool null_sentinels_possible = +// Always treat Numpy's NaT as null +TYPE == NPY_DATETIME || Review comment: AFAIU There's no other way to interpret `NaT` other than `NULL` (unless there's a standard that defines it in a different way than "missing"). nan is part of the IEEE floating point specification (as I'm sure you know) and it has a different meaning than null. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382735#comment-16382735 ] ASF GitHub Bot commented on ARROW-2135: --- cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#discussion_r171710346 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -501,6 +501,14 @@ def test_float_nulls(self): result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) +def test_float_nulls_to_ints(self): +# ARROW-2135 +df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]}) +schema = pa.schema([pa.field("a", pa.int16(), nullable=True)]) +table = pa.Table.from_pandas(df, schema=schema) +assert table[0].to_pylist() == [1, 2, None] +tm.assert_frame_equal(df, table.to_pandas()) Review comment: That's fine. Was just wondering. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382734#comment-16382734 ] ASF GitHub Bot commented on ARROW-2135: --- cpcloud commented on a change in pull request #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#discussion_r171710263 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -501,6 +501,14 @@ def test_float_nulls(self): result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) +def test_float_nulls_to_ints(self): +# ARROW-2135 +df = pd.DataFrame({"a": [1.0, 2.0, pd.np.NaN]}) +schema = pa.schema([pa.field("a", pa.int16(), nullable=True)]) +table = pa.Table.from_pandas(df, schema=schema) +assert table[0].to_pylist() == [1, 2, None] +tm.assert_frame_equal(df, table.to_pandas()) Review comment: It looks like it's a hard cast: ``` In [7]: pa.array([1, 2, 3.190, np.nan], type=pa.int64()) Out[6]: [ 1, 2, 3, NA ] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
[ https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382733#comment-16382733 ] Alex Hagerman commented on ARROW-2242: -- I think these may be related? https://github.com/apache/arrow/issues/1677 > [Python] ParquetFile.read does not accommodate large binary data > - > > Key: ARROW-2242 > URL: https://issues.apache.org/jira/browse/ARROW-2242 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Chris Ellison >Priority: Major > Fix For: 0.9.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2243) [C++] Enable IPO/LTO
Phillip Cloud created ARROW-2243: Summary: [C++] Enable IPO/LTO Key: ARROW-2243 URL: https://issues.apache.org/jira/browse/ARROW-2243 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.8.0 Reporter: Phillip Cloud Assignee: Phillip Cloud Fix For: 0.9.0 We should enable interprocedural/link-time optimization. CMake >= 3.9.4 supports a generic way of doing this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception
[ https://issues.apache.org/jira/browse/ARROW-2240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382617#comment-16382617 ] Phillip Cloud commented on ARROW-2240: -- PR coming shortly. > [Python] Array initialization with leading numpy nan fails with exception > - > > Key: ARROW-2240 > URL: https://issues.apache.org/jira/browse/ARROW-2240 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Florian Jetter >Priority: Minor > > > Arrow initialization fails for string arrays with leading numpy NAN > {code:java} > import pyarrow as pa > import numpy as np > pa.array([np.nan, 'str']) > # Py3: ArrowException: Unknown error: must be real number, not str > # Py2: ArrowException: Unknown error: a float is required{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
[ https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382568#comment-16382568 ] Chris Ellison edited comment on ARROW-2242 at 3/1/18 8:10 PM: -- Related ticket is not code-related, but workflow-related in terms of reading/writing binary data was (Author: leftscreencorner): Not code-related, but workflow related in terms of reading/writing binary data. > [Python] ParquetFile.read does not accommodate large binary data > - > > Key: ARROW-2242 > URL: https://issues.apache.org/jira/browse/ARROW-2242 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Chris Ellison >Priority: Major > Fix For: 0.9.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
[ https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382568#comment-16382568 ] Chris Ellison commented on ARROW-2242: -- Not code-related, but workflow related in terms of reading/writing binary data. > [Python] ParquetFile.read does not accommodate large binary data > - > > Key: ARROW-2242 > URL: https://issues.apache.org/jira/browse/ARROW-2242 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Chris Ellison >Priority: Major > Fix For: 0.9.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
[ https://issues.apache.org/jira/browse/ARROW-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Ellison updated ARROW-2242: - Description: When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however. {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] t3 = pa.concat_tables(tables) scenario() {code} was: When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however. {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing #tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] #t3 = pa.concat_tables(tables) scenario() {code} > [Python] ParquetFile.read does not accommodate large binary data > - > > Key: ARROW-2242 > URL: https://issues.apache.org/jira/browse/ARROW-2242 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Chris Ellison >Priority: Major > Fix For: 0.9.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2242) [Python] ParquetFile.read does not accommodate large binary data
Chris Ellison created ARROW-2242: Summary: [Python] ParquetFile.read does not accommodate large binary data Key: ARROW-2242 URL: https://issues.apache.org/jira/browse/ARROW-2242 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Chris Ellison Fix For: 0.9.0 When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError due to it not creating chunked arrays. Reading each row group individually and then concatenating the tables works, however. {code:java} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq x = pa.array(list('1' * 2**30)) demo = 'demo.parquet' def scenario(): t = pa.Table.from_arrays([x], ['x']) writer = pq.ParquetWriter(demo, t.schema) for i in range(2): writer.write_table(t) writer.close() pf = pq.ParquetFile(demo) # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot contain more than 2147483646 bytes, have 2147483647 t2 = pf.read() # Works, but note, there are 32 row groups, not 2 as suggested by: # https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing #tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] #t3 = pa.concat_tables(tables) scenario() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382538#comment-16382538 ] ASF GitHub Bot commented on ARROW-2145: --- cpcloud commented on issue #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#issuecomment-369709126 @wesm @pitrou this is passing on travis: https://travis-ci.org/cpcloud/arrow/builds/347872453 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382509#comment-16382509 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171667937 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Looks like it's pretty straightforward to go back and forth over that boundary https://github.com/pybind/pybind11/blob/master/docs/advanced/pycpp/object.rst#casting-back-and-forth This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382490#comment-16382490 ] ASF GitHub Bot commented on ARROW-2238: --- MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699817 I will try on my end as well This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382484#comment-16382484 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699582 > Also, probably, usage of RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK should solve issue with selected compiler overwrite. Last I tried it seemed it didn't work. I might give it a try again... This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382481#comment-16382481 ] ASF GitHub Bot commented on ARROW-2238: --- MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369699161 @pitrou it seems that we already try to use `ccache` [there](https://github.com/apache/arrow/blob/master/cpp/CMakeLists.txt#L68) if it's presented. I'm wondering if it will make more sense to refactor referenced lines and optionally use `clcache` for MSVC? Also, probably, usage of RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK should solve issue with selected compiler overwrite. And it seems that starting from Cmake 3.4.0 [CXX_COMPILER_LAUNCHER](https://cmake.org/cmake/help/v3.4/prop_tgt/LANG_COMPILER_LAUNCHER.html#prop_tgt:%3CLANG%3E_COMPILER_LAUNCHER) variable is available, but we stick to CMake of min ver 3.2 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382451#comment-16382451 ] ASF GitHub Bot commented on ARROW-2232: --- wesm commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171657341 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: e.g. https://github.com/apache/arrow/blob/master/python/pyarrow/tests/pyarrow_cython_example.pyx This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382449#comment-16382449 ] ASF GitHub Bot commented on ARROW-2232: --- wesm commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171657138 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Currently, pyarrow has a _public_ Cython and C++ API. If pybind does not support creating public C/C++ API for thirdparty libraries to expose its extension types to non-Python code, it is a non-starter This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382437#comment-16382437 ] ASF GitHub Bot commented on ARROW-2238: --- MaxRis commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369688088 @pitrou I will check, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag
[ https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382436#comment-16382436 ] Uwe L. Korn commented on ARROW-2241: Ah, got it! > [Python] Simple script for running all current ASV benchmarks at a commit or > tag > > > Key: ARROW-2241 > URL: https://issues.apache.org/jira/browse/ARROW-2241 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > The objective of this is to be able to get a graph for performance at each > release tag for the currently-defined benchmarks (including benchmarks that > did not exist in older tags) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag
[ https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382431#comment-16382431 ] Wes McKinney commented on ARROW-2241: - {{asv run}} does not build the C++ dependencies > [Python] Simple script for running all current ASV benchmarks at a commit or > tag > > > Key: ARROW-2241 > URL: https://issues.apache.org/jira/browse/ARROW-2241 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > The objective of this is to be able to get a graph for performance at each > release tag for the currently-defined benchmarks (including benchmarks that > did not exist in older tags) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382417#comment-16382417 ] ASF GitHub Bot commented on ARROW-2236: --- TheNeuralBit commented on a change in pull request #1683: ARROW-2236: [JS] Add more complete set of predicates URL: https://github.com/apache/arrow/pull/1683#discussion_r171646498 ## File path: js/test/unit/vector-tests.ts ## @@ -18,7 +18,7 @@ import { TextEncoder } from 'text-encoding-utf-8'; import Arrow from '../Arrow'; import { type, TypedArray, TypedArrayConstructor, Vector } from '../../src/Arrow'; -import { packBools } from '../../src/util/bit' Review comment: Yeah good call. My syntax checker, [tsuquyomi](https://github.com/Quramy/tsuquyomi), complains about the `const { type, Vector } = Arrow;` approach so I shied away from it, but the tests run just fine. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] Add more complete set of predicates > > > Key: ARROW-2236 > URL: https://issues.apache.org/jira/browse/ARROW-2236 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > > Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || > We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-488) [Python] Implement conversion between integer coded as floating points with NaN to an Arrow integer type
[ https://issues.apache.org/jira/browse/ARROW-488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382404#comment-16382404 ] Antoine Pitrou commented on ARROW-488: -- Is this the same as ARROW-2135, or am I missing something here? > [Python] Implement conversion between integer coded as floating points with > NaN to an Arrow integer type > > > Key: ARROW-488 > URL: https://issues.apache.org/jira/browse/ARROW-488 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: Analytics > Fix For: 0.10.0 > > > For example: if pandas has casted integer data to float, this would enable > the integer data to be recovered (so long as the values fall in the ~2^53 > floating point range for exact integer representation) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize
[ https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382366#comment-16382366 ] Antoine Pitrou commented on ARROW-1894: --- A memoryview has metadata associated to it (data type, shape, strides...). Should it be considered a Tensor instead? > [Python] Treat CPython memoryview or buffer objects equivalently to > pyarrow.Buffer in pyarrow.serialize > --- > > Key: ARROW-1894 > URL: https://issues.apache.org/jira/browse/ARROW-1894 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > These should be treated as Buffer-like on serialize. We should consider how > to "box" the buffers as the appropriate kind of object (Buffer, memoryview, > etc.) when being deserialized -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2081) Hdfs client isn't fork-safe
[ https://issues.apache.org/jira/browse/ARROW-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382361#comment-16382361 ] Antoine Pitrou commented on ARROW-2081: --- For the record, if you want decent multiprocessing performance together with fork safety, I would suggest using the "forkserver" method, not "spawn". (Note the C libhdfs3 library isn't fork-safe, so no need to try it out IMHO :-)) > Hdfs client isn't fork-safe > --- > > Key: ARROW-2081 > URL: https://issues.apache.org/jira/browse/ARROW-2081 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Jim Crist >Priority: Major > > Given the following script: > > {code:java} > import multiprocessing as mp > import pyarrow as pa > def ls(h): > print("calling ls") > return h.ls("/tmp") > if __name__ == '__main__': > h = pa.hdfs.connect() > print("Using 'spawn'") > pool = mp.get_context('spawn').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'spawn' succeeded\n") > print("Using 'fork'") > pool = mp.get_context('fork').Pool(2) > results = pool.map(ls, [h, h]) > sol = h.ls("/tmp") > for r in results: > assert r == sol > print("'fork' succeeded") > {code} > > Results in the following output: > > {code:java} > $ python test.py > Using 'spawn' > calling ls > calling ls > 'spawn' succeeded > Using 'fork{code} > > The process then hangs, and I have to `kill -9` the forked worker processes. > > I'm unable to get the libhdfs3 driver to work, so I'm unsure if this is a > problem with libhdfs or just arrow's use of it (a quick google search didn't > turn up anything useful). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382338#comment-16382338 ] ASF GitHub Bot commented on ARROW-2145: --- pitrou commented on a change in pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#discussion_r171628747 ## File path: ci/travis_install_osx.sh ## @@ -0,0 +1,21 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +brew update +brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile Review comment: Not really, though given the filename it might be better to avoid further mistakes :-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382337#comment-16382337 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171628604 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: I'll open a JIRA if there isn't already one, and start a mailing list discussion. GitHub is getting a bit chatty. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382333#comment-16382333 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171628259 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Thus completely hiding the fact that there's a `shared_ptr` in play from Python users. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382335#comment-16382335 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171628429 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: To be clear, I'm advocating for the replacement of Cython with pybind11. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382329#comment-16382329 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171628034 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: > If the constructor took the C++ shared_ptr as argument and checked its validity, you wouldn't need to sprinkle checks in the other methods/properties. With pybind the situation is even better, because it would allow us to have constructors for numpy arrays and python lists with the same API e.g., `pa.Tensor([1])`/`pa.Tensor(np.array([1]))` without having to deal with initialization by hand at all. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382323#comment-16382323 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369660991 I think that's because `CMAKE_CXX_COMPILER` forcefully overrides the compiler command. When using the Visual Studio generators, you traditionally don't need to run `vcvarsall.bat` (presumably because cmake would hardcode the full compiler path), but then `clcache` fails finding the compiler. So it's possible that calling `vcvarsall.bat` is all that's needed here. But that would also change the workflow people may be accustomed to. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382322#comment-16382322 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369660991 I think that's because `CMAKE_CXX_COMPILER` forcefully overrides the compiler command. When using the Visual Studio generators, you traditionally don't need to run `vcvarsall.bat` (presumably because cmake would hardcode the full compiler path), but then `clcache` fails finding the compiler. So it's possible that calling `vcvarsall.bat` is all that's needed here. But that would also change the workflow may be accustomed to. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag
[ https://issues.apache.org/jira/browse/ARROW-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382321#comment-16382321 ] Uwe L. Korn commented on ARROW-2241: Isn't this what {{asv run}} is for? > [Python] Simple script for running all current ASV benchmarks at a commit or > tag > > > Key: ARROW-2241 > URL: https://issues.apache.org/jira/browse/ARROW-2241 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > The objective of this is to be able to get a graph for performance at each > release tag for the currently-defined benchmarks (including benchmarks that > did not exist in older tags) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2241) [Python] Simple script for running all current ASV benchmarks at a commit or tag
Wes McKinney created ARROW-2241: --- Summary: [Python] Simple script for running all current ASV benchmarks at a commit or tag Key: ARROW-2241 URL: https://issues.apache.org/jira/browse/ARROW-2241 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.9.0 The objective of this is to be able to get a graph for performance at each release tag for the currently-defined benchmarks (including benchmarks that did not exist in older tags) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382303#comment-16382303 ] ASF GitHub Bot commented on ARROW-2232: --- pitrou commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171623439 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Besides, aren't the hand-written checks mandated by the current constructor signature and the fact that you have to go through a classmethod to create a proper instance of each Cython wrapper class? If the constructor took the C++ `shared_ptr` as argument and checked its validity, you wouldn't need to sprinkle checks in the other methods/properties. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382295#comment-16382295 ] ASF GitHub Bot commented on ARROW-2232: --- pitrou commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171621956 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: I have experience both with Cython and the Python C API, Cython is a much more reasonable choice to me. The effort spent on comparable features is easily 2x or 3x larger when writing C code against the CPython API (and the opportunity for bugs is also much higher, given you have to deal with refcounting and GC details by hand). Furthemore, Cython makes it easy to use high-level Python features that are a major pain to emulate in plain C. Just my 2 cents :-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382286#comment-16382286 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171620833 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: I would think there's much less boilerplate to write a pure C API + pybind, than to write a pure C API + C extensions. pybind [supports numpy](http://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html#) as well, hiding a lot of the complexity of the C APIs behind the guarantees provided by C++ RAII, objects, and templates. The pure C API would look the same regardless, it's really just a question of whether we want to take advantage of the convenience of pybind, or hand roll extensions where we would have to deal with reference counting and numpy's C API. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2237: Fix Version/s: 0.9.0 > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.9.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382265#comment-16382265 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-369649945 Rebasing this again This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Assignee: Albert Shieh >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382257#comment-16382257 ] ASF GitHub Bot commented on ARROW-2238: --- wesm commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369648904 @MaxRis can take a look. How does the error you linked to arise? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382252#comment-16382252 ] ASF GitHub Bot commented on ARROW-2232: --- wesm commented on issue #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#issuecomment-369647759 Test suite is failing for some reason This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382249#comment-16382249 ] ASF GitHub Bot commented on ARROW-2232: --- wesm commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171613766 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Is pybind really an option? It seems more likely we would migrate bindings to plain C extensions so that we can develop a more mature public C API for pyarrow This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2240) [Python] Array initialization with leading numpy nan fails with exception
Florian Jetter created ARROW-2240: - Summary: [Python] Array initialization with leading numpy nan fails with exception Key: ARROW-2240 URL: https://issues.apache.org/jira/browse/ARROW-2240 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Florian Jetter Arrow initialization fails for string arrays with leading numpy NAN {code:java} import pyarrow as pa import numpy as np pa.array([np.nan, 'str']) # Py3: ArrowException: Unknown error: must be real number, not str # Py2: ArrowException: Unknown error: a float is required{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382205#comment-16382205 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#issuecomment-369636633 AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.157 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382185#comment-16382185 ] ASF GitHub Bot commented on ARROW-2232: --- cpcloud commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171598294 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: This is really a stopgap until we can replace our Cython API with pybind11. Cython's inability to deal with `shared_ptr` is huge burden right now. We have all these handwritten checks to make sure that an object is valid, which would be completely unnecessary if we moved to pybind. In any event, I'll add these checks here so we can get this merged. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2239) [C++] Update build docs for Windows
[ https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2239: - Assignee: Antoine Pitrou > [C++] Update build docs for Windows > --- > > Key: ARROW-2239 > URL: https://issues.apache.org/jira/browse/ARROW-2239 > Project: Apache Arrow > Issue Type: Task > Components: C++, Documentation >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Fix For: 0.9.0 > > > We should update the C++ build docs for Windows to recommend use of Ninja and > clcache for faster builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382179#comment-16382179 ] ASF GitHub Bot commented on ARROW-2145: --- cpcloud commented on a change in pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#discussion_r171597154 ## File path: ci/travis_install_osx.sh ## @@ -0,0 +1,21 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +brew update +brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile Review comment: @pitrou This is already conditioned on in `.travis.yml` just before this script is called. Is it really necessary to condition on it again? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382155#comment-16382155 ] ASF GitHub Bot commented on ARROW-2145: --- cpcloud commented on a change in pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#discussion_r171594338 ## File path: ci/travis_install_osx.sh ## @@ -0,0 +1,21 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +brew update +brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile Review comment: Yes This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2239) [C++] Update build docs for Windows
Antoine Pitrou created ARROW-2239: - Summary: [C++] Update build docs for Windows Key: ARROW-2239 URL: https://issues.apache.org/jira/browse/ARROW-2239 Project: Apache Arrow Issue Type: Task Components: C++, Documentation Reporter: Antoine Pitrou Fix For: 0.9.0 We should update the C++ build docs for Windows to recommend use of Ninja and clcache for faster builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2020) [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit timestamps
[ https://issues.apache.org/jira/browse/ARROW-2020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382116#comment-16382116 ] Antoine Pitrou commented on ARROW-2020: --- Ok. Here the changeset does "fix" the crash somehow, but it still produces bogus results. This issue might be related to ARROW-2026, in that when you pass {{coerce_timestamps}}, {{write_table}} seems to save the timestamps as int64 rather than int96. > [Python] Parquet segfaults if coercing ns timestamps and writing 96-bit > timestamps > -- > > Key: ARROW-2020 > URL: https://issues.apache.org/jira/browse/ARROW-2020 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS: Mac OS X 10.13.2 > Python: 3.6.4 > PyArrow: 0.8.0 >Reporter: Diego Argueta >Priority: Major > Labels: timestamps > Fix For: 0.9.0 > > Attachments: crash-report.txt > > > If you try to write a PyArrow table containing nanosecond-resolution > timestamps to Parquet using `coerce_timestamps` and > `use_deprecated_int96_timestamps=True`, the Arrow library will segfault. > The crash doesn't happen if you don't coerce the timestamp resolution or if > you don't use 96-bit timestamps. > > > *To Reproduce:* > > {code:java} > > import datetime > import pyarrow > from pyarrow import parquet > schema = pyarrow.schema([ > pyarrow.field('last_updated', pyarrow.timestamp('ns')), > ]) > data = [ > pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('ns')), > ] > table = pyarrow.Table.from_arrays(data, ['last_updated']) > with open('test_file.parquet', 'wb') as fdesc: > parquet.write_table(table, fdesc, > coerce_timestamps='us', # 'ms' works too > use_deprecated_int96_timestamps=True){code} > > See attached file for the crash report. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns
[ https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn closed ARROW-2194. -- Resolution: Not A Problem > [Python] Pandas columns metadata incorrect for empty string columns > --- > > Key: ARROW-2194 > URL: https://issues.apache.org/jira/browse/ARROW-2194 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Florian Jetter >Priority: Minor > Fix For: 0.9.0 > > > The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas > DataFrame is unexpectedly {{float64}} > > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > import json > empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': > np.array([], dtype=np.bytes_)}) > empty_table = pa.Table.from_pandas(empty_df) > json.loads(empty_table.schema.metadata[b'pandas'])['columns'] > # Same behavior for input dtype np.unicode_ > [{u'field_name': u'bytes', > u'metadata': None, > u'name': u'bytes', > u'numpy_type': u'object', > u'pandas_type': u'float64'}, > {u'field_name': u'unicode', > u'metadata': None, > u'name': u'unicode', > u'numpy_type': u'object', > u'pandas_type': u'float64'}, > {u'field_name': u'__index_level_0__', > u'metadata': None, > u'name': None, > u'numpy_type': u'int64', > u'pandas_type': u'int64'}]{code} > > Tested on Debian 8 with python2.7 and python 3.6.4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381897#comment-16381897 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369571901 Also I'm not sure whether we have a Windows developer on board; I'm merely launching a VM from time to time but otherwise work on Ubuntu :-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381895#comment-16381895 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369571559 The failure at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.155/job/q31movster4v84d9 shows this can lead to inconsistencies or errors: cmake first tries to detect the compller from user-supplied information (generator, environment variables), then the clcache setting overrides that detection. Either we add logic to try and avoid such errors, or we simply let people override CC/CXX if they want to use clcache (statu quo). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381881#comment-16381881 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou commented on issue #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684#issuecomment-369569373 Example AppVeyor build at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.155 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2238: -- Labels: pull-request-available (was: ) > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2238) [C++] Detect clcache in cmake configuration
[ https://issues.apache.org/jira/browse/ARROW-2238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381880#comment-16381880 ] ASF GitHub Bot commented on ARROW-2238: --- pitrou opened a new pull request #1684: ARROW-2238: [C++] Detect and use clcache in cmake configuration URL: https://github.com/apache/arrow/pull/1684 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Detect clcache in cmake configuration > --- > > Key: ARROW-2238 > URL: https://issues.apache.org/jira/browse/ARROW-2238 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2238) [C++] Detect clcache in cmake configuration
Antoine Pitrou created ARROW-2238: - Summary: [C++] Detect clcache in cmake configuration Key: ARROW-2238 URL: https://issues.apache.org/jira/browse/ARROW-2238 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Antoine Pitrou Assignee: Antoine Pitrou By default Windows builds should use clcache if installed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381816#comment-16381816 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#issuecomment-369552237 I addressed some review comments now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381796#comment-16381796 ] ASF GitHub Bot commented on ARROW-2232: --- pitrou commented on a change in pull request #1682: ARROW-2232: [Python] pyarrow.Tensor constructor segfaults URL: https://github.com/apache/arrow/pull/1682#discussion_r171513562 ## File path: python/pyarrow/array.pxi ## @@ -497,10 +497,15 @@ cdef class Tensor: self.type = pyarrow_wrap_data_type(self.tp.type()) def __repr__(self): +if self.tp is NULL: Review comment: Having `__repr__` raise isn't really nice, because it breaks debugging. It would be better to return something like ``. Also you probably want to protect other methods, and raise there if the object isn't initialized. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2232) [Python] pyarrow.Tensor constructor segfaults
[ https://issues.apache.org/jira/browse/ARROW-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2232: -- Labels: pull-request-available (was: ) > [Python] pyarrow.Tensor constructor segfaults > - > > Key: ARROW-2232 > URL: https://issues.apache.org/jira/browse/ARROW-2232 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{pa.Tensor()}}, {{pa.Tensor([])}}, and {{pa.Tensor([1.0])}} all crash the > interpreter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381773#comment-16381773 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#discussion_r171509916 ## File path: cpp/src/arrow/python/numpy_to_arrow.cc ## @@ -113,6 +145,55 @@ inline int64_t ValuesToBitmap(PyArrayObject* arr, uint8_t* bitmap) { return null_count; } +class NumPyNullsConverter { + public: + /// Convert the given array's null values to a null bitmap. + /// The null bitmap is only allocated if null values are ever possible. + static Status Convert(MemoryPool* pool, PyArrayObject* arr, +bool use_pandas_null_sentinels, +std::shared_ptr* out_null_bitmap_, +int64_t* out_null_count) { +NumPyNullsConverter converter(pool, arr, use_pandas_null_sentinels); +RETURN_NOT_OK(VisitNumpyArrayInline(arr, &converter)); +*out_null_bitmap_ = converter.null_bitmap_; +*out_null_count = converter.null_count_; +return Status::OK(); + } + + template + Status Visit(PyArrayObject* arr) { +typedef internal::npy_traits traits; + +const bool null_sentinels_possible = +// Always treat Numpy's NaT as null +TYPE == NPY_DATETIME || Review comment: By the way, I don't know what that is, but this is required to have the tests pass. Why do we always treat NaT as null but not floating-point NaN? @wesm This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2237) [Python] Huge tables test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381767#comment-16381767 ] Antoine Pitrou commented on ARROW-2237: --- [~pcmoritz] > [Python] Huge tables test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > &type, buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2237) [Python] Huge tables test failure
Antoine Pitrou created ARROW-2237: - Summary: [Python] Huge tables test failure Key: ARROW-2237 URL: https://issues.apache.org/jira/browse/ARROW-2237 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Antoine Pitrou This is a new failure here (Ubuntu 16.04, x86-64): {code} _ test_use_huge_pages _ Traceback (most recent call last): File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, in test_use_huge_pages create_object(plasma_client, 1) File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in create_object seal=seal) File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in create_object_with_id memory_buffer = client.create(object_id, data_size, metadata) File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create File "error.pxi", line 79, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, &buffer) /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, &type, buffer) Encountered unexpected EOF Captured stderr call - Allowing the Plasma store to use up to 0.1GB of memory. Starting object store with directory /mnt/hugepages and huge page support enabled create_buffer failed to open file /mnt/hugepages/plasmapSNc0X {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2145) [Python] Decimal conversion not working for NaN values
[ https://issues.apache.org/jira/browse/ARROW-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381745#comment-16381745 ] ASF GitHub Bot commented on ARROW-2145: --- pitrou commented on a change in pull request #1651: ARROW-2145/ARROW-2153/ARROW-2157/ARROW-2160/ARROW-2177: [Python] Decimal conversion not working for NaN values URL: https://github.com/apache/arrow/pull/1651#discussion_r171504185 ## File path: ci/travis_install_osx.sh ## @@ -0,0 +1,21 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +brew update +brew bundle --file=$TRAVIS_BUILD_DIR/c_glib/Brewfile Review comment: Shouldn't that be conditioned on ARROW_CI_C_GLIB_AFFECTED? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Decimal conversion not working for NaN values > -- > > Key: ARROW-2145 > URL: https://issues.apache.org/jira/browse/ARROW-2145 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Antony Mayi >Assignee: Phillip Cloud >Priority: Major > Labels: pull-request-available > > {code:python} > import pyarrow as pa > import pandas as pd > import decimal > pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), > decimal.Decimal('NaN')]})) > {code} > throws following exception: > {code} > Traceback (most recent call last): > File "", line 1, in > File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927) > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 350, in > dataframe_to_arrays > convert_types)] > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 349, in > > for c, t in zip(columns_to_convert, > File "/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 345, in > convert_column > return pa.array(col, from_pandas=True, type=ty) > File "pyarrow/array.pxi", line 170, in pyarrow.lib.array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224) > File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465) > File "pyarrow/error.pxi", line 98, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:9068) > pyarrow.lib.ArrowException: Unknown error: an integer is required (got type > str) > {code} > Same problem with other special decimal values like {{infinity}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381742#comment-16381742 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on a change in pull request #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#discussion_r171503669 ## File path: cpp/src/arrow/python/type_traits.h ## @@ -127,8 +134,14 @@ template <> struct npy_traits { typedef PyObject* value_type; static constexpr bool supports_nulls = true; + + static inline bool isnull(PyObject* v) { return v != Py_None; } Review comment: Nice catch :-) I'm not sure how to test it. Defining `isnull` is necessary for compiling, but that path isn't taken at runtime as object arrays are handled separately. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2194) [Python] Pandas columns metadata incorrect for empty string columns
[ https://issues.apache.org/jira/browse/ARROW-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381739#comment-16381739 ] Florian Jetter commented on ARROW-2194: --- I haven't checked the master but on `0.8.0` all other column types write their pandas type explicitly although the df is empty. I do not have any objections as long as this behavior is consistent across dtypes > [Python] Pandas columns metadata incorrect for empty string columns > --- > > Key: ARROW-2194 > URL: https://issues.apache.org/jira/browse/ARROW-2194 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Florian Jetter >Priority: Minor > Fix For: 0.9.0 > > > The {{pandas_type}} for {{bytes}} or {{unicode}} columns of an empty pandas > DataFrame is unexpectedly {{float64}} > > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > import json > empty_df = pd.DataFrame({'unicode': np.array([], dtype=np.unicode_), 'bytes': > np.array([], dtype=np.bytes_)}) > empty_table = pa.Table.from_pandas(empty_df) > json.loads(empty_table.schema.metadata[b'pandas'])['columns'] > # Same behavior for input dtype np.unicode_ > [{u'field_name': u'bytes', > u'metadata': None, > u'name': u'bytes', > u'numpy_type': u'object', > u'pandas_type': u'float64'}, > {u'field_name': u'unicode', > u'metadata': None, > u'name': u'unicode', > u'numpy_type': u'object', > u'pandas_type': u'float64'}, > {u'field_name': u'__index_level_0__', > u'metadata': None, > u'name': None, > u'numpy_type': u'int64', > u'pandas_type': u'int64'}]{code} > > Tested on Debian 8 with python2.7 and python 3.6.4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)