[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Description: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. This is useful when we want to read into contiguous buffers, because it allows us to allocate the right sizes up front. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. was: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. This is useful when we want to read into contiguous buffers, because > it allows us to allocate the right sizes up front. > I'd like to propose that we add `num_rows` as a field in the file footer so > it's easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Component/s: C++ > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field to the footer so it's > easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Component/s: Format > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field to the footer so it's > easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2296) Add num_rows to file footer
Lawrence Chan created ARROW-2296: Summary: Add num_rows to file footer Key: ARROW-2296 URL: https://issues.apache.org/jira/browse/ARROW-2296 Project: Apache Arrow Issue Type: Improvement Reporter: Lawrence Chan Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field to the footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2296) Add num_rows to file footer
[ https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2296: - Description: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field in the file footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. was: Maybe I'm overlooking something, but I don't see something on the API surface to get the number of rows in a arrow file without reading all the record batches. I'd like to propose that we add `num_rows` as a field to the footer so it's easy to query without reading the whole file. Meanwhile, before we get that added to the official format fbs, it would be nice to have a method that iterates over the record batch headers and sums up the lengths without reading the actual record batch body. > Add num_rows to file footer > --- > > Key: ARROW-2296 > URL: https://issues.apache.org/jira/browse/ARROW-2296 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Lawrence Chan >Priority: Minor > > Maybe I'm overlooking something, but I don't see something on the API surface > to get the number of rows in a arrow file without reading all the record > batches. > I'd like to propose that we add `num_rows` as a field in the file footer so > it's easy to query without reading the whole file. > Meanwhile, before we get that added to the official format fbs, it would be > nice to have a method that iterates over the record batch headers and sums up > the lengths without reading the actual record batch body. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393861#comment-16393861 ] ASF GitHub Bot commented on ARROW-2181: --- BryanCutler commented on issue #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#issuecomment-371983425 Screen of doc changes ![image](https://user-images.githubusercontent.com/4534389/37235796-9902463c-23b6-11e8-8030-27e538bf5d11.png) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2099) [Python] Support DictionaryArray::FromArrays in Python bindings
[ https://issues.apache.org/jira/browse/ARROW-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393590#comment-16393590 ] ASF GitHub Bot commented on ARROW-2099: --- wesm commented on a change in pull request #1734: ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default URL: https://github.com/apache/arrow/pull/1734#discussion_r173567908 ## File path: python/setup.py ## @@ -208,8 +207,10 @@ def _run_cmake(self): cmake_options.append('-DCMAKE_BUILD_TYPE={0}' .format(self.build_type.lower())) -cmake_options.append('-DBoost_NAMESPACE={}'.format( -self.boost_namespace)) + +if self.boost_namespace is not None: +cmake_options.append('-DBoost_NAMESPACE={}' + .format(self.boost_namespace)) Review comment: I added this to prevent a CMake warning when Boost isn't being bundled This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support DictionaryArray::FromArrays in Python bindings > --- > > Key: ARROW-2099 > URL: https://issues.apache.org/jira/browse/ARROW-2099 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work from ARROW-1757. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2099) [Python] Support DictionaryArray::FromArrays in Python bindings
[ https://issues.apache.org/jira/browse/ARROW-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2099: -- Labels: pull-request-available (was: ) > [Python] Support DictionaryArray::FromArrays in Python bindings > --- > > Key: ARROW-2099 > URL: https://issues.apache.org/jira/browse/ARROW-2099 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work from ARROW-1757. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2099) [Python] Support DictionaryArray::FromArrays in Python bindings
[ https://issues.apache.org/jira/browse/ARROW-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393588#comment-16393588 ] ASF GitHub Bot commented on ARROW-2099: --- wesm opened a new pull request #1734: ARROW-2099: [Python] Add safe option to DictionaryArray.from_arrays to do boundschecking of indices by default URL: https://github.com/apache/arrow/pull/1734 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support DictionaryArray::FromArrays in Python bindings > --- > > Key: ARROW-2099 > URL: https://issues.apache.org/jira/browse/ARROW-2099 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Follow up work from ARROW-1757. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2295: - Description: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. was: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional indexes and whatnot of the `pandas.Series`. > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a > flag, for example. The `to_pandas()` function is of course welcome to use the > `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2295) Add to_numpy functions
[ https://issues.apache.org/jira/browse/ARROW-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lawrence Chan updated ARROW-2295: - Description: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. was: There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional index and whatnot of the `pandas.Series`. > Add to_numpy functions > -- > > Key: ARROW-2295 > URL: https://issues.apache.org/jira/browse/ARROW-2295 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Lawrence Chan >Priority: Minor > > There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to > propose that we include both. > Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho > is very confusing :). I think it would be more intuitive for the > `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` > objects, and the `to_numpy()` functions to return `numpy.ndarray` and either > a ordered dict of `numpy.ndarray` or a structured `numpy.ndarray` depending > on a flag, for example. The `to_pandas()` function is of course welcome to > use the `to_numpy()` func to avoid the additional index and whatnot of the > `pandas.Series`. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean
[ https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1491: Fix Version/s: (was: 0.9.0) 0.10.0 > [C++] Add casting implementations from strings to numbers or boolean > > > Key: ARROW-1491 > URL: https://issues.apache.org/jira/browse/ARROW-1491 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Licht Takeuchi >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2295) Add to_numpy functions
Lawrence Chan created ARROW-2295: Summary: Add to_numpy functions Key: ARROW-2295 URL: https://issues.apache.org/jira/browse/ARROW-2295 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Lawrence Chan There are `to_pandas()` functions, but no `to_numpy()` functions. I'd like to propose that we include both. Also, `pyarrow.lib.Array.to_pandas()` returns a `numpy.ndarray`, which imho is very confusing :). I think it would be more intuitive for the `to_pandas()` functions to return `pandas.Series` and `pandas.DataFrame` objects, and the `to_numpy()` functions to return `numpy.ndarray` and either a dict of `numpy.ndarray` or a structured `numpy.ndarray` depending on a flag, for example. The `to_pandas()` function is of course welcome to use the `to_numpy()` func to avoid the additional indexes and whatnot of the `pandas.Series`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2027) [C++] ipc::Message::SerializeTo does not pad the message body
[ https://issues.apache.org/jira/browse/ARROW-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2027: Fix Version/s: (was: 0.9.0) 0.10.0 > [C++] ipc::Message::SerializeTo does not pad the message body > - > > Key: ARROW-2027 > URL: https://issues.apache.org/jira/browse/ARROW-2027 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I just want to note this here as a follow-up to ARROW-1860. I think that > padding is the correct behavior, but I wasn't sure enough to make the fix > there -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1491) [C++] Add casting implementations from strings to numbers or boolean
[ https://issues.apache.org/jira/browse/ARROW-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393516#comment-16393516 ] Wes McKinney commented on ARROW-1491: - Moving to 0.10.0 > [C++] Add casting implementations from strings to numbers or boolean > > > Key: ARROW-1491 > URL: https://issues.apache.org/jira/browse/ARROW-1491 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Licht Takeuchi >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2027) [C++] ipc::Message::SerializeTo does not pad the message body
[ https://issues.apache.org/jira/browse/ARROW-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393515#comment-16393515 ] Wes McKinney commented on ARROW-2027: - Deferring this to 0.10.0 to prioritize getting ARROW-1860, ARROW-1996 done > [C++] ipc::Message::SerializeTo does not pad the message body > - > > Key: ARROW-2027 > URL: https://issues.apache.org/jira/browse/ARROW-2027 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Assignee: Panchen Xue >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I just want to note this here as a follow-up to ARROW-1860. I think that > padding is the correct behavior, but I wasn't sure enough to make the fix > there -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393510#comment-16393510 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on a change in pull request #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#discussion_r173561454 ## File path: python/pyarrow/tests/test_array.py ## @@ -258,6 +258,26 @@ def test_union_from_sparse(): assert result.to_pylist() == [b'a', 1, b'b', b'c', 2, 3, b'd'] +def test_string_from_buffers(): +array = pa.array(["a", None, "b", "c"]) + +buffers = array.buffers() +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0], array.null_count, +array.offset) +assert copied.to_pylist() == ["a", None, "b", "c"] + +copied = pa.StringArray.from_buffers( +len(array), buffers[1], buffers[2], buffers[0]) +assert copied.to_pylist() == ["a", None, "b", "c"] + +sliced = array[1:] +copied = pa.StringArray.from_buffers( +len(sliced), buffers[1], buffers[2], buffers[0], -1, sliced.offset) +buffers = array.buffers() +assert copied.to_pylist() == [None, "b", "c"] Review comment: We need to add checks for the computed null count, and for the case where the null bitmap is not passed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2294) Fix splitAndTransfer for variable width vector
Siddharth Teotia created ARROW-2294: --- Summary: Fix splitAndTransfer for variable width vector Key: ARROW-2294 URL: https://issues.apache.org/jira/browse/ARROW-2294 Project: Apache Arrow Issue Type: Bug Components: Java - Vectors Reporter: Siddharth Teotia Assignee: Siddharth Teotia When we splitAndTransfer a vector, the value count to set for the target vector should be equal to split length and not the value count of source vector. We have seen cases in operator slike FLATTEN and under low memory conditions, we end up allocating a lot more memory for the target vector because of using a large value in setValueCount after split and transfer is done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393501#comment-16393501 ] ASF GitHub Bot commented on ARROW-2181: --- BryanCutler commented on issue #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#issuecomment-371936444 Great thanks, I'll run it This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393495#comment-16393495 ] ASF GitHub Bot commented on ARROW-2262: --- wesm commented on a change in pull request #1702: ARROW-2262: [Python] Support slicing on pyarrow.ChunkedArray URL: https://github.com/apache/arrow/pull/1702#discussion_r173559111 ## File path: python/pyarrow/table.pxi ## @@ -77,6 +77,52 @@ cdef class ChunkedArray: self._check_nullptr() return self.chunked_array.null_count() +def __getitem__(self, key): +cdef int64_t item +cdef int i +self._check_nullptr() +if isinstance(key, slice): +return _normalize_slice(self, key) +elif isinstance(key, six.integer_types): +item = key +if item >= self.chunked_array.length() or item < 0: +return IndexError("ChunkedArray selection out of bounds") Review comment: Agreed, perhaps let's handle this as a follow up patch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support slicing on pyarrow.ChunkedArray > > > Key: ARROW-2262 > URL: https://issues.apache.org/jira/browse/ARROW-2262 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2262. - Resolution: Fixed Issue resolved by pull request 1702 [https://github.com/apache/arrow/pull/1702] > [Python] Support slicing on pyarrow.ChunkedArray > > > Key: ARROW-2262 > URL: https://issues.apache.org/jira/browse/ARROW-2262 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393497#comment-16393497 ] ASF GitHub Bot commented on ARROW-2262: --- wesm closed pull request #1702: ARROW-2262: [Python] Support slicing on pyarrow.ChunkedArray URL: https://github.com/apache/arrow/pull/1702 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index d95f01661..776b96531 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -387,6 +387,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int num_chunks() shared_ptr[CArray] chunk(int i) shared_ptr[CDataType] type() +shared_ptr[CChunkedArray] Slice(int64_t offset, int64_t length) const +shared_ptr[CChunkedArray] Slice(int64_t offset) const cdef cppclass CColumn" arrow::Column": CColumn(const shared_ptr[CField]& field, diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi index c27c0edd9..94041e465 100644 --- a/python/pyarrow/table.pxi +++ b/python/pyarrow/table.pxi @@ -77,6 +77,52 @@ cdef class ChunkedArray: self._check_nullptr() return self.chunked_array.null_count() +def __getitem__(self, key): +cdef int64_t item +cdef int i +self._check_nullptr() +if isinstance(key, slice): +return _normalize_slice(self, key) +elif isinstance(key, six.integer_types): +item = key +if item >= self.chunked_array.length() or item < 0: +return IndexError("ChunkedArray selection out of bounds") +for i in range(self.num_chunks): +if item < self.chunked_array.chunk(i).get().length(): +return self.chunk(i)[item] +else: +item -= self.chunked_array.chunk(i).get().length() +else: +raise TypeError("key must either be a slice or integer") + +def slice(self, offset=0, length=None): +""" +Compute zero-copy slice of this ChunkedArray + +Parameters +-- +offset : int, default 0 +Offset from start of array to slice +length : int, default None +Length of slice (default is until end of batch starting from +offset) + +Returns +--- +sliced : ChunkedArray +""" +cdef shared_ptr[CChunkedArray] result + +if offset < 0: +raise IndexError('Offset must be non-negative') + +if length is None: +result = self.chunked_array.Slice(offset) +else: +result = self.chunked_array.Slice(offset, length) + +return pyarrow_wrap_chunked_array(result) + @property def num_chunks(self): """ diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index e72761d32..356ecb7e0 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -24,6 +24,21 @@ import pyarrow as pa +def test_chunked_array_getitem(): +data = [ +pa.array([1, 2, 3]), +pa.array([4, 5, 6]) +] +data = pa.chunked_array(data) +assert data[1].as_py() == 2 + +data_slice = data[2:4] +assert data_slice.to_pylist() == [3, 4] + +data_slice = data[4:-1] +assert data_slice.to_pylist() == [5] + + def test_column_basics(): data = [ pa.array([-10, -5, 0, 5, 10]) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support slicing on pyarrow.ChunkedArray > > > Key: ARROW-2262 > URL: https://issues.apache.org/jira/browse/ARROW-2262 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393494#comment-16393494 ] ASF GitHub Bot commented on ARROW-2181: --- BryanCutler commented on a change in pull request #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#discussion_r173559109 ## File path: python/doc/source/data.rst ## @@ -393,6 +393,22 @@ objects to contiguous NumPy arrays for use in pandas: c.to_pandas() +Multiple tables can also be concatenated together to form a single table using +``pa.concat_tables``, if the schemas are equal: + +.. ipython:: python + + tables = [table] * 2 + table_all = pa.concat_tables(tables) + table_all.num_rows + c = table_all[0] + c.data.num_chunks + +This is similar to ``Table.from_batches``, but uses tables as input instead of +record batches. Record batches can be made into tables, but not the other way +around, so if your data is already in table form, then use +``pa.concat_tables``. Review comment: Sure, will do This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on
[ https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2193: Fix Version/s: (was: 0.9.0) 0.10.0 > [Plasma] plasma_store has runtime dependency on Boost shared libraries when > ARROW_BOOST_USE_SHARED=on > - > > Key: ARROW-2193 > URL: https://issues.apache.org/jira/browse/ARROW-2193 > Project: Apache Arrow > Issue Type: Bug > Components: Plasma (C++) >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I'm not sure why, but when I run the pyarrow test suite (for example > {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly: > {code:bash} > $ ps fuwww > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > [...] > antoine 27869 12.0 0.4 863208 68976 pts/7S13:41 0:01 > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27885 13.0 0.4 863076 68560 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27901 12.1 0.4 863076 68320 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27920 13.6 0.4 863208 68868 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > [etc.] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393467#comment-16393467 ] ASF GitHub Bot commented on ARROW-2181: --- wesm commented on issue #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#issuecomment-371930766 see https://github.com/apache/arrow/tree/master/python#building-the-documentation -- if you are able to run the test suite / build the project then you can run those commands This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393464#comment-16393464 ] ASF GitHub Bot commented on ARROW-2288: --- wesm closed pull request #1723: ARROW-2288: [Python] Fix slicing logic URL: https://github.com/apache/arrow/pull/1723 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index e785c0ec5..cc65c0771 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -205,15 +205,25 @@ def asarray(values, type=None): def _normalize_slice(object arrow_obj, slice key): -cdef Py_ssize_t n = len(arrow_obj) +cdef: +Py_ssize_t start, stop, step +Py_ssize_t n = len(arrow_obj) start = key.start or 0 -while start < 0: +if start < 0: start += n +if start < 0: +start = 0 +elif start >= n: +start = n stop = key.stop if key.stop is not None else n -while stop < 0: +if stop < 0: stop += n +if stop < 0: +stop = 0 +elif stop >= n: +stop = n step = key.step or 1 if step != 1: diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index f034d78b3..4a337ad23 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -132,17 +132,18 @@ def test_array_slice(): # Test slice notation assert arr[2:].equals(arr.slice(2)) - assert arr[2:5].equals(arr.slice(2, 3)) - assert arr[-5:].equals(arr.slice(len(arr) - 5)) - with pytest.raises(IndexError): arr[::-1] - with pytest.raises(IndexError): arr[::2] +n = len(arr) +for start in range(-n * 2, n * 2): +for stop in range(-n * 2, n * 2): +assert arr[start:stop].to_pylist() == arr.to_pylist()[start:stop] + def test_array_factory_invalid_type(): arr = np.array([datetime.timedelta(1), datetime.timedelta(2)]) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2288. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1723 [https://github.com/apache/arrow/pull/1723] > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393463#comment-16393463 ] ASF GitHub Bot commented on ARROW-2181: --- wesm commented on a change in pull request #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#discussion_r173553957 ## File path: python/doc/source/data.rst ## @@ -393,6 +393,22 @@ objects to contiguous NumPy arrays for use in pandas: c.to_pandas() +Multiple tables can also be concatenated together to form a single table using +``pa.concat_tables``, if the schemas are equal: + +.. ipython:: python + + tables = [table] * 2 + table_all = pa.concat_tables(tables) + table_all.num_rows + c = table_all[0] + c.data.num_chunks + +This is similar to ``Table.from_batches``, but uses tables as input instead of +record batches. Record batches can be made into tables, but not the other way +around, so if your data is already in table form, then use +``pa.concat_tables``. Review comment: Can you spell out `pyarrow` here and above? We might turn these into API links later This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on
[ https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393461#comment-16393461 ] ASF GitHub Bot commented on ARROW-2193: --- wesm commented on issue #1711: WIP ARROW-2193: [C++] Do not depend on Boost libraries at runtime in plasma_store URL: https://github.com/apache/arrow/pull/1711#issuecomment-371929957 @xhochy It seems that `CHECK_CXX_COMPILER_FLAG` doesn't turn up the right answer on macOS. I'm afraid we'll have to leave this PR in WIP unless someone else can figure this out for 0.9.0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Plasma] plasma_store has runtime dependency on Boost shared libraries when > ARROW_BOOST_USE_SHARED=on > - > > Key: ARROW-2193 > URL: https://issues.apache.org/jira/browse/ARROW-2193 > Project: Apache Arrow > Issue Type: Bug > Components: Plasma (C++) >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure why, but when I run the pyarrow test suite (for example > {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly: > {code:bash} > $ ps fuwww > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > [...] > antoine 27869 12.0 0.4 863208 68976 pts/7S13:41 0:01 > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27885 13.0 0.4 863076 68560 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27901 12.1 0.4 863076 68320 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27920 13.6 0.4 863208 68868 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > [etc.] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2193) [Plasma] plasma_store has runtime dependency on Boost shared libraries when ARROW_BOOST_USE_SHARED=on
[ https://issues.apache.org/jira/browse/ARROW-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393456#comment-16393456 ] ASF GitHub Bot commented on ARROW-2193: --- wesm commented on issue #1711: WIP ARROW-2193: [C++] Do not depend on Boost libraries at runtime in plasma_store URL: https://github.com/apache/arrow/pull/1711#issuecomment-371928949 I am not sure why libboost_regex is still a runtime dependency, even with `--as-needed` -- Plasma doesn't appear to have any symbols with a transitive dependency on code in arrow/util/decimal.cc (where boost::regex is used) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Plasma] plasma_store has runtime dependency on Boost shared libraries when > ARROW_BOOST_USE_SHARED=on > - > > Key: ARROW-2193 > URL: https://issues.apache.org/jira/browse/ARROW-2193 > Project: Apache Arrow > Issue Type: Bug > Components: Plasma (C++) >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure why, but when I run the pyarrow test suite (for example > {{py.test pyarrow/tests/test_plasma.py}}), plasma_store forks endlessly: > {code:bash} > $ ps fuwww > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > [...] > antoine 27869 12.0 0.4 863208 68976 pts/7S13:41 0:01 > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27885 13.0 0.4 863076 68560 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27901 12.1 0.4 863076 68320 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > antoine 27920 13.6 0.4 863208 68868 pts/7S13:41 0:01 \_ > /home/antoine/miniconda3/envs/pyarrow/bin/python > /home/antoine/arrow/python/pyarrow/plasma_store -s /tmp/plasma_store40209423 > -m 1 > [etc.] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393449#comment-16393449 ] ASF GitHub Bot commented on ARROW-2181: --- BryanCutler commented on issue #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733#issuecomment-371926208 cc @wesm @xhochy , I didn't get a chance to build the docs to try this out yet. Is it done with the gen_api_docs docker image, or is there another easier way to do it? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393442#comment-16393442 ] ASF GitHub Bot commented on ARROW-2181: --- BryanCutler opened a new pull request #1733: ARROW-2181: [PYTHON][DOC] Add doc on usage of concat_tables URL: https://github.com/apache/arrow/pull/1733 Adding Python API doc on usage of pa.concat_tables. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2181: -- Labels: pull-request-available (was: ) > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2291) [C++] README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393321#comment-16393321 ] ASF GitHub Bot commented on ARROW-2291: --- wesm closed pull request #1732: ARROW-2291: [C++] Add additional libboost-regex-dev to build instructions in README URL: https://github.com/apache/arrow/pull/1732 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/README.md b/cpp/README.md index daeeade72..8018efd9e 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -35,6 +35,7 @@ On Ubuntu/Debian you can install the requirements with: ```shell sudo apt-get install cmake \ libboost-dev \ + libboost-regex-dev \ libboost-filesystem-dev \ libboost-system-dev ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] README missing instructions for libboost-regex-dev > > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2291) [C++] README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2291: -- Labels: pull-request-available (was: ) > [C++] README missing instructions for libboost-regex-dev > > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2291) [C++] README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2291. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1732 [https://github.com/apache/arrow/pull/1732] > [C++] README missing instructions for libboost-regex-dev > > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2293) [JS] Print release vote e-mail template when making source release
Wes McKinney created ARROW-2293: --- Summary: [JS] Print release vote e-mail template when making source release Key: ARROW-2293 URL: https://issues.apache.org/jira/browse/ARROW-2293 Project: Apache Arrow Issue Type: Improvement Components: JavaScript Reporter: Wes McKinney This would help with streamlining the source release process. See https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate for an example -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393313#comment-16393313 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173526240 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Right, makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393062#comment-16393062 ] ASF GitHub Bot commented on ARROW-2236: --- wesm commented on issue #1683: ARROW-2236: [JS] Add more complete set of predicates URL: https://github.com/apache/arrow/pull/1683#issuecomment-371856273 Nope, it should be pretty straightforward. It might be nice to have the release script generate an e-mail template like https://github.com/apache/parquet-cpp/blob/master/dev/release/release-candidate#L257 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] Add more complete set of predicates > > > Key: ARROW-2236 > URL: https://issues.apache.org/jira/browse/ARROW-2236 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || > We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393057#comment-16393057 ] ASF GitHub Bot commented on ARROW-2275: --- wesm commented on a change in pull request #1717: ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data() URL: https://github.com/apache/arrow/pull/1717#discussion_r173491445 ## File path: cpp/src/arrow/buffer.h ## @@ -54,7 +54,11 @@ class ARROW_EXPORT Buffer { /// /// \note The passed memory must be kept alive through some other means Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), +data_(data), +mutable_data_(nullptr), Review comment: `nullptr` is incompatible C++/CLI, which I believe is a way for C# code to link to C++ libraries. see https://msdn.microsoft.com/en-us/library/4ex65770.aspx This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393044#comment-16393044 ] ASF GitHub Bot commented on ARROW-2236: --- TheNeuralBit commented on issue #1683: ARROW-2236: [JS] Add more complete set of predicates URL: https://github.com/apache/arrow/pull/1683#issuecomment-371852964 Thanks Wes! Anything I can do to help out with the release process? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] Add more complete set of predicates > > > Key: ARROW-2236 > URL: https://issues.apache.org/jira/browse/ARROW-2236 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || > We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393035#comment-16393035 ] ASF GitHub Bot commented on ARROW-2275: --- pitrou commented on a change in pull request #1717: ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data() URL: https://github.com/apache/arrow/pull/1717#discussion_r173487404 ## File path: cpp/src/arrow/buffer.h ## @@ -54,7 +54,11 @@ class ARROW_EXPORT Buffer { /// /// \note The passed memory must be kept alive through some other means Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), +data_(data), +mutable_data_(nullptr), Review comment: Can you expand on the NULLPTR macro? Does it do something more then `nullptr`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393031#comment-16393031 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173486366 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Yes, it was, it just wasn't necessarily the one expected by the caller according to its semantics. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393011#comment-16393011 ] ASF GitHub Bot commented on ARROW-2236: --- wesm commented on issue #1683: ARROW-2236: [JS] Add more complete set of predicates URL: https://github.com/apache/arrow/pull/1683#issuecomment-371846331 I'll get going on a 0.3.1 RC This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [JS] Add more complete set of predicates > > > Key: ARROW-2236 > URL: https://issues.apache.org/jira/browse/ARROW-2236 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || > We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393009#comment-16393009 ] ASF GitHub Bot commented on ARROW-2236: --- wesm closed pull request #1683: ARROW-2236: [JS] Add more complete set of predicates URL: https://github.com/apache/arrow/pull/1683 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/js/src/Arrow.externs.js b/js/src/Arrow.externs.js index cf4db9134..be89be152 100644 --- a/js/src/Arrow.externs.js +++ b/js/src/Arrow.externs.js @@ -74,17 +74,24 @@ var custom = function () {}; var Value = function() {}; /** @type {?} */ -Value.prototype.gteq; +Value.prototype.ge; /** @type {?} */ -Value.prototype.lteq; +Value.prototype.le; /** @type {?} */ Value.prototype.eq; +/** @type {?} */ +Value.prototype.lt; +/** @type {?} */ +Value.prototype.gt; +/** @type {?} */ +Value.prototype.ne; var Col = function() {}; /** @type {?} */ Col.prototype.bind; var Or = function() {}; var And = function() {}; +var Not = function() {}; var GTeq = function () {}; /** @type {?} */ GTeq.prototype.and; @@ -108,6 +115,8 @@ Predicate.prototype.and; /** @type {?} */ Predicate.prototype.or; /** @type {?} */ +Predicate.prototype.not; +/** @type {?} */ Predicate.prototype.ands; var Literal = function() {}; @@ -209,6 +218,8 @@ Int128.prototype.plus /** @type {?} */ Int128.prototype.hex +var packBools = function() {}; + var Type = function() {}; /** @type {?} */ Type.NONE = function() {}; diff --git a/js/src/Arrow.ts b/js/src/Arrow.ts index 4a0a2ac6d..23e8b9983 100644 --- a/js/src/Arrow.ts +++ b/js/src/Arrow.ts @@ -18,7 +18,8 @@ import * as type_ from './type'; import * as data_ from './data'; import * as vector_ from './vector'; -import * as util_ from './util/int'; +import * as util_int_ from './util/int'; +import * as util_bit_ from './util/bit'; import * as visitor_ from './visitor'; import * as view_ from './vector/view'; import * as predicate_ from './predicate'; @@ -40,9 +41,10 @@ export { Table, DataFrame, NextFunc, BindFunc, CountByResult }; export { Field, Schema, RecordBatch, Vector, Type }; export namespace util { -export import Uint64 = util_.Uint64; -export import Int64 = util_.Int64; -export import Int128 = util_.Int128; +export import Uint64 = util_int_.Uint64; +export import Int64 = util_int_.Int64; +export import Int128 = util_int_.Int128; +export import packBools = util_bit_.packBools; } export namespace data { @@ -173,6 +175,7 @@ export namespace predicate { export import Or = predicate_.Or; export import Col = predicate_.Col; export import And = predicate_.And; +export import Not = predicate_.Not; export import GTeq = predicate_.GTeq; export import LTeq = predicate_.LTeq; export import Value = predicate_.Value; @@ -222,16 +225,16 @@ Table['empty'] = Table.empty; Vector['create'] = Vector.create; RecordBatch['from'] = RecordBatch.from; -util_.Uint64['add'] = util_.Uint64.add; -util_.Uint64['multiply'] = util_.Uint64.multiply; +util_int_.Uint64['add'] = util_int_.Uint64.add; +util_int_.Uint64['multiply'] = util_int_.Uint64.multiply; -util_.Int64['add'] = util_.Int64.add; -util_.Int64['multiply'] = util_.Int64.multiply; -util_.Int64['fromString'] = util_.Int64.fromString; +util_int_.Int64['add'] = util_int_.Int64.add; +util_int_.Int64['multiply'] = util_int_.Int64.multiply; +util_int_.Int64['fromString'] = util_int_.Int64.fromString; -util_.Int128['add'] = util_.Int128.add; -util_.Int128['multiply'] = util_.Int128.multiply; -util_.Int128['fromString'] = util_.Int128.fromString; +util_int_.Int128['add'] = util_int_.Int128.add; +util_int_.Int128['multiply'] = util_int_.Int128.multiply; +util_int_.Int128['fromString'] = util_int_.Int128.fromString; data_.ChunkedData['computeOffsets'] = data_.ChunkedData.computeOffsets; diff --git a/js/src/predicate.ts b/js/src/predicate.ts index b177b4fa7..bff393863 100644 --- a/js/src/predicate.ts +++ b/js/src/predicate.ts @@ -26,14 +26,23 @@ export abstract class Value { if (!(other instanceof Value)) { other = new Literal(other); } return new Equals(this, other); } -lteq(other: Value | T): Predicate { +le(other: Value | T): Predicate { if (!(other instanceof Value)) { other = new Literal(other); } return new LTeq(this, other); } -gteq(other: Value | T): Predicate { +ge(other: Value | T): Predicate { if (!(other instanceof Value)) { other = new Literal(other); } return new GTeq(this, other); } +lt(other: Value | T): Predicate { +return new Not(this.ge(other)); +} +gt(other: Value | T): Predi
[jira] [Resolved] (ARROW-2236) [JS] Add more complete set of predicates
[ https://issues.apache.org/jira/browse/ARROW-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2236. - Resolution: Fixed Fix Version/s: JS-0.4.0 Issue resolved by pull request 1683 [https://github.com/apache/arrow/pull/1683] > [JS] Add more complete set of predicates > > > Key: ARROW-2236 > URL: https://issues.apache.org/jira/browse/ARROW-2236 > Project: Apache Arrow > Issue Type: Task > Components: JavaScript >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > Fix For: JS-0.4.0 > > > Right now {{arrow.predicate}} only supports ==, >=, <=, &&, and || > We should also support !=, <, > at the very least -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals
[ https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2250: --- Assignee: Mitar > plasma_store process should cleanup on INT and TERM signals > --- > > Key: ARROW-2250 > URL: https://issues.apache.org/jira/browse/ARROW-2250 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Mitar >Assignee: Mitar >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Currently, if you send an INT and TERM signal to a parent plasma store > process (Python one) it terminates it without cleaning the child process. > This makes it hard to run plasma store in non-interactive mode. Inside shell > ctrl-c kills both processes. > Moreover, INT prints out an ugly KeyboardInterrup exception. Probably > something nicer should be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals
[ https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2250. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1705 [https://github.com/apache/arrow/pull/1705] > plasma_store process should cleanup on INT and TERM signals > --- > > Key: ARROW-2250 > URL: https://issues.apache.org/jira/browse/ARROW-2250 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Mitar >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Currently, if you send an INT and TERM signal to a parent plasma store > process (Python one) it terminates it without cleaning the child process. > This makes it hard to run plasma store in non-interactive mode. Inside shell > ctrl-c kills both processes. > Moreover, INT prints out an ugly KeyboardInterrup exception. Probably > something nicer should be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2250) plasma_store process should cleanup on INT and TERM signals
[ https://issues.apache.org/jira/browse/ARROW-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393007#comment-16393007 ] ASF GitHub Bot commented on ARROW-2250: --- wesm closed pull request #1705: ARROW-2250: [Python] Do not create a subprocess for plasma but just use existing process URL: https://github.com/apache/arrow/pull/1705 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 15a37ca10..2afa6c150 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -142,11 +142,9 @@ def _plasma_store_entry_point(): """ import os import pyarrow -import subprocess import sys plasma_store_executable = os.path.join(pyarrow.__path__[0], "plasma_store") -process = subprocess.Popen([plasma_store_executable] + sys.argv[1:]) -process.wait() +os.execv(plasma_store_executable, sys.argv) # -- # Deprecations This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > plasma_store process should cleanup on INT and TERM signals > --- > > Key: ARROW-2250 > URL: https://issues.apache.org/jira/browse/ARROW-2250 > Project: Apache Arrow > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Mitar >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Currently, if you send an INT and TERM signal to a parent plasma store > process (Python one) it terminates it without cleaning the child process. > This makes it hard to run plasma store in non-interactive mode. Inside shell > ctrl-c kills both processes. > Moreover, INT prints out an ugly KeyboardInterrup exception. Probably > something nicer should be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2269) [Python] Cannot build bdist_wheel for Python
[ https://issues.apache.org/jira/browse/ARROW-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2269: Component/s: Python > [Python] Cannot build bdist_wheel for Python > > > Key: ARROW-2269 > URL: https://issues.apache.org/jira/browse/ARROW-2269 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Mitar >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I am trying current master. > I ran: > > python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet > --with-plasma --bundle-arrow-cpp bdist_wheel > > Output: > > running build_ext > creating build > creating build/temp.linux-x86_64-3.6 > -- Runnning cmake for pyarrow > cmake -DPYTHON_EXECUTABLE=.../Temp/arrow/pyarrow/bin/python > -DPYARROW_BUILD_PARQUET=on -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUNDLE_ARROW_CPP=ON > -DCMAKE_BUILD_TYPE=release .../Temp/arrow/arrow/python > -- The C compiler identification is GNU 7.2.0 > -- The CXX compiler identification is GNU 7.2.0 > -- Check for working C compiler: /usr/bin/cc > -- Check for working C compiler: /usr/bin/cc -- works > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Detecting C compile features > -- Detecting C compile features - done > -- Check for working CXX compiler: /usr/bin/c++ > -- Check for working CXX compiler: /usr/bin/c++ -- works > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Detecting CXX compile features > -- Detecting CXX compile features - done > INFOCompiler command: /usr/bin/c++ > INFOCompiler version: Using built-in specs. > COLLECT_GCC=/usr/bin/c++ > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper > OFFLOAD_TARGET_NAMES=nvptx-none > OFFLOAD_TARGET_DEFAULT=1 > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 7.2.0-8ubuntu3.2' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs > --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr > --with-gcc-major-version-only --program-suffix=-7 > --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id > --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix > --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu > --enable-libstdcxx-debug --enable-libstdcxx-time=yes > --with-default-libstdcxx-abi=new --enable-gnu-unique-object > --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie > --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto > --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 > --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic > --enable-offload-targets=nvptx-none --without-cuda-driver > --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu > --target=x86_64-linux-gnu > Thread model: posix > gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2) > INFOCompiler id: GNU > Selected compiler gcc 7.2.0 > -- Performing Test CXX_SUPPORTS_SSE3 > -- Performing Test CXX_SUPPORTS_SSE3 - Success > -- Performing Test CXX_SUPPORTS_ALTIVEC > -- Performing Test CXX_SUPPORTS_ALTIVEC - Failed > Configured for RELEASE build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: RELEASE > -- Build output directory: > .../Temp/arrow/arrow/python/build/temp.linux-x86_64-3.6/release/ > -- Found PythonInterp: .../Temp/arrow/pyarrow/bin/python (found version > "3.6.3") > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PythonLibs: > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found NumPy: version "1.14.1" > .../Temp/arrow/pyarrow/lib/python3.6/site-packages/numpy/core/include > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") > -- Checking for module 'arrow' > -- Found arrow, version 0.9.0-SNAPSHOT > -- Arrow ABI version: 0.0.0 > -- Arrow SO version: 0 > -- Found the Arrow core library: .../Temp/arrow/dist/lib/libarrow.so > -- Found the Arrow Python library: .../Temp/arrow/dist/lib/libarrow_python.so > -- Boost version: 1.63.0 > -- Found the following Boost libraries: > -- system > -- filesy
[jira] [Updated] (ARROW-2269) [Python] Cannot build bdist_wheel for Python
[ https://issues.apache.org/jira/browse/ARROW-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2269: Summary: [Python] Cannot build bdist_wheel for Python (was: Cannot build bdist_wheel for Python) > [Python] Cannot build bdist_wheel for Python > > > Key: ARROW-2269 > URL: https://issues.apache.org/jira/browse/ARROW-2269 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.9.0 >Reporter: Mitar >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I am trying current master. > I ran: > > python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet > --with-plasma --bundle-arrow-cpp bdist_wheel > > Output: > > running build_ext > creating build > creating build/temp.linux-x86_64-3.6 > -- Runnning cmake for pyarrow > cmake -DPYTHON_EXECUTABLE=.../Temp/arrow/pyarrow/bin/python > -DPYARROW_BUILD_PARQUET=on -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUNDLE_ARROW_CPP=ON > -DCMAKE_BUILD_TYPE=release .../Temp/arrow/arrow/python > -- The C compiler identification is GNU 7.2.0 > -- The CXX compiler identification is GNU 7.2.0 > -- Check for working C compiler: /usr/bin/cc > -- Check for working C compiler: /usr/bin/cc -- works > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Detecting C compile features > -- Detecting C compile features - done > -- Check for working CXX compiler: /usr/bin/c++ > -- Check for working CXX compiler: /usr/bin/c++ -- works > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Detecting CXX compile features > -- Detecting CXX compile features - done > INFOCompiler command: /usr/bin/c++ > INFOCompiler version: Using built-in specs. > COLLECT_GCC=/usr/bin/c++ > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper > OFFLOAD_TARGET_NAMES=nvptx-none > OFFLOAD_TARGET_DEFAULT=1 > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 7.2.0-8ubuntu3.2' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs > --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr > --with-gcc-major-version-only --program-suffix=-7 > --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id > --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix > --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu > --enable-libstdcxx-debug --enable-libstdcxx-time=yes > --with-default-libstdcxx-abi=new --enable-gnu-unique-object > --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie > --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto > --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 > --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic > --enable-offload-targets=nvptx-none --without-cuda-driver > --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu > --target=x86_64-linux-gnu > Thread model: posix > gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2) > INFOCompiler id: GNU > Selected compiler gcc 7.2.0 > -- Performing Test CXX_SUPPORTS_SSE3 > -- Performing Test CXX_SUPPORTS_SSE3 - Success > -- Performing Test CXX_SUPPORTS_ALTIVEC > -- Performing Test CXX_SUPPORTS_ALTIVEC - Failed > Configured for RELEASE build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: RELEASE > -- Build output directory: > .../Temp/arrow/arrow/python/build/temp.linux-x86_64-3.6/release/ > -- Found PythonInterp: .../Temp/arrow/pyarrow/bin/python (found version > "3.6.3") > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PythonLibs: > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found NumPy: version "1.14.1" > .../Temp/arrow/pyarrow/lib/python3.6/site-packages/numpy/core/include > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") > -- Checking for module 'arrow' > -- Found arrow, version 0.9.0-SNAPSHOT > -- Arrow ABI version: 0.0.0 > -- Arrow SO version: 0 > -- Found the Arrow core library: .../Temp/arrow/dist/lib/libarrow.so > -- Found the Arrow Python library: .../Temp/arrow/dist/lib/libarrow_python.so > -- Boost version: 1.63.
[jira] [Commented] (ARROW-2269) Cannot build bdist_wheel for Python
[ https://issues.apache.org/jira/browse/ARROW-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393004#comment-16393004 ] ASF GitHub Bot commented on ARROW-2269: --- wesm closed pull request #1718: ARROW-2269: [Python] Make boost namespace selectable in wheels URL: https://github.com/apache/arrow/pull/1718 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index d17194628..44a3c6c91 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -76,6 +76,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(PYARROW_BUNDLE_ARROW_CPP "Bundle the Arrow C++ libraries" OFF) + option(PYARROW_BUNDLE_BOOST +"Bundle the Boost libraries when we bundle Arrow C++" +ON) set(PYARROW_CXXFLAGS "" CACHE STRING "Compiler flags to append when compiling Arrow") endif() @@ -266,7 +269,7 @@ if (PYARROW_BUNDLE_ARROW_CPP) SO_VERSION ${ARROW_SO_VERSION}) # boost - if (PYARROW_BOOST_USE_SHARED) + if (PYARROW_BOOST_USE_SHARED AND PYARROW_BUNDLE_BOOST) set(Boost_USE_STATIC_LIBS OFF) set(Boost_USE_MULTITHREADED ON) if (MSVC AND ARROW_USE_STATIC_CRT) diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index f83c75972..5df55a65c 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -40,6 +40,8 @@ export PYARROW_BUILD_TYPE='release' export PYARROW_WITH_PARQUET=1 export PYARROW_WITH_PLASMA=1 export PYARROW_BUNDLE_ARROW_CPP=1 +export PYARROW_BUNDLE_BOOST=1 +export PYARROW_BOOST_NAMESPACE=arrow_boost export PKG_CONFIG_PATH=/arrow-dist/lib64/pkgconfig export PYARROW_CMAKE_OPTIONS='-DTHRIFT_HOME=/usr -DBoost_NAMESPACE=arrow_boost -DBOOST_ROOT=/arrow_boost_dist' # Ensure the target directory exists @@ -66,7 +68,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do # Clear output directory rm -rf dist/ echo "=== (${PYTHON}) Building wheel ===" -PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py build_ext --inplace --with-parquet --bundle-arrow-cpp +PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py build_ext --inplace --with-parquet --bundle-arrow-cpp --bundle-boost --boost-namespace=arrow_boost PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py bdist_wheel echo "=== (${PYTHON}) Test the existence of optional modules ===" diff --git a/python/setup.py b/python/setup.py index f3521f277..6f0b0fa4d 100644 --- a/python/setup.py +++ b/python/setup.py @@ -94,11 +94,15 @@ def run(self): description = "Build the C-extensions for arrow" user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), ('build-type=', None, 'build type (debug or release)'), + ('boost-namespace=', None, + 'namespace of boost (default: boost)'), ('with-parquet', None, 'build the Parquet extension'), ('with-static-parquet', None, 'link parquet statically'), ('with-static-boost', None, 'link boost statically'), ('with-plasma', None, 'build the Plasma extension'), ('with-orc', None, 'build the ORC extension'), + ('bundle-boost', None, + 'bundle the (shared) Boost libraries'), ('bundle-arrow-cpp', None, 'bundle the Arrow C++ libraries')] + _build_ext.user_options) @@ -107,6 +111,8 @@ def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() +self.boost_namespace = os.environ.get('PYARROW_BOOST_NAMESPACE', + 'boost') self.cmake_cxxflags = os.environ.get('PYARROW_CXXFLAGS', '') @@ -128,6 +134,10 @@ def initialize_options(self): os.environ.get('PYARROW_WITH_ORC', '0')) self.bundle_arrow_cpp = strtobool( os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) +# Default is True but this only is actually bundled when +# we also bundle arrow-cpp. +self.bundle_boost = strtobool( +os.environ.get('PYARROW_BUNDLE_BOOST', '1')) CYTHON_MODULE_NAMES = [ 'lib', @@ -186,15 +196,20 @@ def _run_cmake(self): if self.bundle_arrow_cpp: cmake_options.append('-DPYARROW_BUNDLE_ARROW_CPP=ON') +cmake_options.append('-DPYARROW_BUNDLE_BOOST=ON') #
[jira] [Resolved] (ARROW-2269) Cannot build bdist_wheel for Python
[ https://issues.apache.org/jira/browse/ARROW-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2269. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1718 [https://github.com/apache/arrow/pull/1718] > Cannot build bdist_wheel for Python > --- > > Key: ARROW-2269 > URL: https://issues.apache.org/jira/browse/ARROW-2269 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging >Affects Versions: 0.9.0 >Reporter: Mitar >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I am trying current master. > I ran: > > python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet > --with-plasma --bundle-arrow-cpp bdist_wheel > > Output: > > running build_ext > creating build > creating build/temp.linux-x86_64-3.6 > -- Runnning cmake for pyarrow > cmake -DPYTHON_EXECUTABLE=.../Temp/arrow/pyarrow/bin/python > -DPYARROW_BUILD_PARQUET=on -DPYARROW_BOOST_USE_SHARED=on > -DPYARROW_BUILD_PLASMA=on -DPYARROW_BUNDLE_ARROW_CPP=ON > -DCMAKE_BUILD_TYPE=release .../Temp/arrow/arrow/python > -- The C compiler identification is GNU 7.2.0 > -- The CXX compiler identification is GNU 7.2.0 > -- Check for working C compiler: /usr/bin/cc > -- Check for working C compiler: /usr/bin/cc -- works > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Detecting C compile features > -- Detecting C compile features - done > -- Check for working CXX compiler: /usr/bin/c++ > -- Check for working CXX compiler: /usr/bin/c++ -- works > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Detecting CXX compile features > -- Detecting CXX compile features - done > INFOCompiler command: /usr/bin/c++ > INFOCompiler version: Using built-in specs. > COLLECT_GCC=/usr/bin/c++ > COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper > OFFLOAD_TARGET_NAMES=nvptx-none > OFFLOAD_TARGET_DEFAULT=1 > Target: x86_64-linux-gnu > Configured with: ../src/configure -v --with-pkgversion='Ubuntu > 7.2.0-8ubuntu3.2' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs > --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr > --with-gcc-major-version-only --program-suffix=-7 > --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id > --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix > --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu > --enable-libstdcxx-debug --enable-libstdcxx-time=yes > --with-default-libstdcxx-abi=new --enable-gnu-unique-object > --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie > --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto > --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 > --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic > --enable-offload-targets=nvptx-none --without-cuda-driver > --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu > --target=x86_64-linux-gnu > Thread model: posix > gcc version 7.2.0 (Ubuntu 7.2.0-8ubuntu3.2) > INFOCompiler id: GNU > Selected compiler gcc 7.2.0 > -- Performing Test CXX_SUPPORTS_SSE3 > -- Performing Test CXX_SUPPORTS_SSE3 - Success > -- Performing Test CXX_SUPPORTS_ALTIVEC > -- Performing Test CXX_SUPPORTS_ALTIVEC - Failed > Configured for RELEASE build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: RELEASE > -- Build output directory: > .../Temp/arrow/arrow/python/build/temp.linux-x86_64-3.6/release/ > -- Found PythonInterp: .../Temp/arrow/pyarrow/bin/python (found version > "3.6.3") > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PythonLibs: > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found NumPy: version "1.14.1" > .../Temp/arrow/pyarrow/lib/python3.6/site-packages/numpy/core/include > -- Searching for Python libs in > .../Temp/arrow/pyarrow/lib64;.../Temp/arrow/pyarrow/lib;/usr/lib/python3.6/config-3.6m-x86_64-linux-gnu > -- Looking for python3.6m > -- Found Python lib > /usr/lib/python3.6/config-3.6m-x86_64-linux-gnu/libpython3.6m.so > -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") > -- Checking for module 'arrow' > -- Found arrow, version 0.9.0-SNAPSHOT > -- Arrow ABI version: 0.0.0 > -- Arrow SO version: 0 > -- Found the Arrow core library: .../Temp/arrow/dist/lib/libarrow.so > -- Found the Arrow Python library: .../Temp/arrow/dist/lib/libarrow_python.so > -- Boost ve
[jira] [Commented] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392998#comment-16392998 ] ASF GitHub Bot commented on ARROW-2268: --- wesm closed pull request #1731: ARROW-2268: Drop usage of md5 checksums for source releases, verification scripts URL: https://github.com/apache/arrow/pull/1731 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh index 62478131d..fa1c3e3ca 100755 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -97,7 +97,6 @@ ${SOURCE_DIR}/run-rat.sh ${tarball} # sign the archive gpg --armor --output ${tarball}.asc --detach-sig ${tarball} -gpg --print-md MD5 ${tarball} > ${tarball}.md5 sha1sum $tarball > ${tarball}.sha1 sha256sum $tarball > ${tarball}.sha256 sha512sum $tarball > ${tarball}.sha512 diff --git a/dev/release/js-source-release.sh b/dev/release/js-source-release.sh index bf32acd05..53b31af62 100755 --- a/dev/release/js-source-release.sh +++ b/dev/release/js-source-release.sh @@ -78,7 +78,6 @@ ${SOURCE_DIR}/run-rat.sh ${tarball} # sign the archive gpg --armor --output ${tarball}.asc --detach-sig ${tarball} -gpg --print-md MD5 ${tarball} > ${tarball}.md5 sha1sum $tarball > ${tarball}.sha1 sha256sum $tarball > ${tarball}.sha256 sha512sum $tarball > ${tarball}.sha512 diff --git a/dev/release/js-verify-release-candidate.sh b/dev/release/js-verify-release-candidate.sh index 5a37e10f7..039c94dec 100755 --- a/dev/release/js-verify-release-candidate.sh +++ b/dev/release/js-verify-release-candidate.sh @@ -54,13 +54,14 @@ fetch_archive() { local dist_name=$1 download_rc_file ${dist_name}.tar.gz download_rc_file ${dist_name}.tar.gz.asc - download_rc_file ${dist_name}.tar.gz.md5 + download_rc_file ${dist_name}.tar.gz.sha1 download_rc_file ${dist_name}.tar.gz.sha512 gpg --verify ${dist_name}.tar.gz.asc ${dist_name}.tar.gz - gpg --print-md MD5 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.md5 if [ "$(uname)" == "Darwin" ]; then +shasum -a 1 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha1 shasum -a 512 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha512 else +sha1sum ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha1 sha512sum ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha512 fi } diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index f33211e26..cb9b01b37 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -62,13 +62,14 @@ fetch_archive() { local dist_name=$1 download_rc_file ${dist_name}.tar.gz download_rc_file ${dist_name}.tar.gz.asc - download_rc_file ${dist_name}.tar.gz.md5 + download_rc_file ${dist_name}.tar.gz.sha1 download_rc_file ${dist_name}.tar.gz.sha512 gpg --verify ${dist_name}.tar.gz.asc ${dist_name}.tar.gz - gpg --print-md MD5 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.md5 if [ "$(uname)" == "Darwin" ]; then +shasum -a 1 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha1 shasum -a 512 ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha512 else +sha1sum ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha1 sha512sum ${dist_name}.tar.gz | diff - ${dist_name}.tar.gz.sha512 fi } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392996#comment-16392996 ] ASF GitHub Bot commented on ARROW-2268: --- wesm commented on issue #1731: ARROW-2268: Drop usage of md5 checksums for source releases, verification scripts URL: https://github.com/apache/arrow/pull/1731#issuecomment-371844496 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2268. - Resolution: Fixed Issue resolved by pull request 1731 [https://github.com/apache/arrow/pull/1731] > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392994#comment-16392994 ] ASF GitHub Bot commented on ARROW-2275: --- wesm closed pull request #1717: ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data() URL: https://github.com/apache/arrow/pull/1717 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 29e2c242a..e32e02c9f 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -70,6 +70,14 @@ Status Buffer::FromString(const std::string& data, std::shared_ptr* out) return FromString(data, default_memory_pool(), out); } +#ifndef NDEBUG +// DCHECK macros aren't allowed in public include files +uint8_t* Buffer::mutable_data() { + DCHECK(is_mutable()); + return mutable_data_; +} +#endif + PoolBuffer::PoolBuffer(MemoryPool* pool) : ResizableBuffer(nullptr, 0) { if (pool == nullptr) { pool = default_memory_pool(); diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index cf25ccd03..ad11ff943 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -54,7 +54,11 @@ class ARROW_EXPORT Buffer { /// /// \note The passed memory must be kept alive through some other means Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), +data_(data), +mutable_data_(NULLPTR), +size_(size), +capacity_(size) {} /// \brief Construct from std::string without copying memory /// @@ -113,7 +117,11 @@ class ARROW_EXPORT Buffer { int64_t capacity() const { return capacity_; } const uint8_t* data() const { return data_; } +#ifdef NDEBUG uint8_t* mutable_data() { return mutable_data_; } +#else + uint8_t* mutable_data(); +#endif int64_t size() const { return size_; } diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 4e4c6b8d5..699dc0393 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -71,7 +71,7 @@ class ARROW_EXPORT Tensor { std::shared_ptr data() const { return data_; } const uint8_t* raw_data() const { return data_->data(); } - uint8_t* raw_data() { return data_->mutable_data(); } + uint8_t* raw_mutable_data() { return data_->mutable_data(); } const std::vector& shape() const { return shape_; } const std::vector& strides() const { return strides_; } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2275. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1717 [https://github.com/apache/arrow/pull/1717] > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392992#comment-16392992 ] ASF GitHub Bot commented on ARROW-2275: --- wesm commented on a change in pull request #1717: ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data() URL: https://github.com/apache/arrow/pull/1717#discussion_r173478919 ## File path: cpp/src/arrow/buffer.h ## @@ -54,7 +54,11 @@ class ARROW_EXPORT Buffer { /// /// \note The passed memory must be kept alive through some other means Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), +data_(data), +mutable_data_(nullptr), Review comment: We need to use the NULLPTR macro here This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2275) [C++] Buffer::mutable_data_ member uninitialized
[ https://issues.apache.org/jira/browse/ARROW-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392991#comment-16392991 ] ASF GitHub Bot commented on ARROW-2275: --- wesm commented on a change in pull request #1717: ARROW-2275: [C++] Guard against bad use of Buffer.mutable_data() URL: https://github.com/apache/arrow/pull/1717#discussion_r173478791 ## File path: cpp/src/arrow/buffer.cc ## @@ -70,6 +70,14 @@ Status Buffer::FromString(const std::string& data, std::shared_ptr* out) return FromString(data, default_memory_pool(), out); } +#ifndef NDEBUG +// DCHECK macros aren't allowed in public include files +uint8_t* Buffer::mutable_data() { + DCHECK(is_mutable()); + return mutable_data_; +} +#endif Review comment: This is a good idea, thank you This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Buffer::mutable_data_ member uninitialized > > > Key: ARROW-2275 > URL: https://issues.apache.org/jira/browse/ARROW-2275 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > > For immutable buffers (i.e. most of them), the {{mutable_data_}} member is > uninitialized. If the user calls {{mutable_data()}} by mistake on such a > buffer, they will get a bogus pointer back. > This is exacerbated by the Tensor API whose const and non-const > {{raw_data()}} methods return different things... > (also an idea: add a DCHECK for mutability before returning from > {{mutable_data()}}?) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2284) [Python] test_plasma error on plasma_store error
[ https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392989#comment-16392989 ] ASF GitHub Bot commented on ARROW-2284: --- wesm closed pull request #1724: ARROW-2284: [Python] Fix error display on test_plasma error URL: https://github.com/apache/arrow/pull/1724 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/tests/test_plasma.py b/python/pyarrow/tests/test_plasma.py index b4e864941..1df213dec 100644 --- a/python/pyarrow/tests/test_plasma.py +++ b/python/pyarrow/tests/test_plasma.py @@ -165,10 +165,8 @@ def start_plasma_store(plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY, time.sleep(0.1) rc = proc.poll() if rc is not None: -err = proc.stderr.read().decode() raise RuntimeError("plasma_store exited unexpectedly with " - "code %d. Error output follows:\n%s\n" - % (rc, err)) + "code %d" % (rc,)) yield plasma_store_name, proc finally: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] test_plasma error on plasma_store error > > > Key: ARROW-2284 > URL: https://issues.apache.org/jira/browse/ARROW-2284 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > > This appears caused by my latest changes: > {code:python} > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, > in setup_method > plasma_store_name, self.p = self.plasma_store_ctx.__enter__() > File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", > line 81, in __enter__ > return next(self.gen) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, > in start_plasma_store > err = proc.stderr.read().decode() > AttributeError: 'NoneType' object has no attribute 'read' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2284) [Python] test_plasma error on plasma_store error
[ https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2284. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1724 [https://github.com/apache/arrow/pull/1724] > [Python] test_plasma error on plasma_store error > > > Key: ARROW-2284 > URL: https://issues.apache.org/jira/browse/ARROW-2284 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > Fix For: 0.9.0 > > > This appears caused by my latest changes: > {code:python} > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, > in setup_method > plasma_store_name, self.p = self.plasma_store_ctx.__enter__() > File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", > line 81, in __enter__ > return next(self.gen) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, > in start_plasma_store > err = proc.stderr.read().decode() > AttributeError: 'NoneType' object has no attribute 'read' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2150) [Python] array equality defaults to identity
[ https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392984#comment-16392984 ] ASF GitHub Bot commented on ARROW-2150: --- wesm closed pull request #1729: ARROW-2150: [Python] Raise NotImplementedError when comparing with pyarrow.Array for now URL: https://github.com/apache/arrow/pull/1729 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi index e785c0ec5..f05806cfa 100644 --- a/python/pyarrow/array.pxi +++ b/python/pyarrow/array.pxi @@ -267,6 +267,10 @@ cdef class Array: self.ap = sp_array.get() self.type = pyarrow_wrap_data_type(self.sp_array.get().type()) +def __richcmp__(Array self, object other, int op): +raise NotImplementedError('Comparisons with pyarrow.Array are not ' + 'implemented') + def _debug_print(self): with nogil: check_status(DebugPrint(deref(self.ap), 0)) diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index f034d78b3..4c14c1c61 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -158,6 +158,16 @@ def test_array_ref_to_ndarray_base(): assert sys.getrefcount(arr) == (refcount + 1) +def test_array_eq_raises(): +# ARROW-2150: we are raising when comparing arrays until we define the +# behavior to either be elementwise comparisons or data equality +arr1 = pa.array([1, 2, 3], type=pa.int32()) +arr2 = pa.array([1, 2, 3], type=pa.int32()) + +with pytest.raises(NotImplementedError): +arr1 == arr2 + + def test_dictionary_from_numpy(): indices = np.repeat([0, 1, 2], 2) dictionary = np.array(['foo', 'bar', 'baz'], dtype=object) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] array equality defaults to identity > > > Key: ARROW-2150 > URL: https://issues.apache.org/jira/browse/ARROW-2150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure this is deliberate, but it doesn't look very desirable to me: > {code} > >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) > False > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2150) [Python] array equality defaults to identity
[ https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2150. - Resolution: Fixed Issue resolved by pull request 1729 [https://github.com/apache/arrow/pull/1729] > [Python] array equality defaults to identity > > > Key: ARROW-2150 > URL: https://issues.apache.org/jira/browse/ARROW-2150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure this is deliberate, but it doesn't look very desirable to me: > {code} > >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) > False > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392841#comment-16392841 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173441381 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: I meant that before your change, the index returned was always valid? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392687#comment-16392687 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: As a side note, if @wesm wants to include this in the release, we can defer API improvements to a later PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392686#comment-16392686 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173416072 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: As a side note, if @wesm wants to include this in the release, we can further API improvements to a later PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392653#comment-16392653 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173404125 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Why? We're still returning a valid index. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer
[ https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392652#comment-16392652 ] Antoine Pitrou commented on ARROW-2292: --- {{py_buffer}} sounds reasonable to me (it's in line with {{to_pylist}}, {{to_pybytes}} etc.). > [Python] More consistent / intuitive name for pyarrow.frombuffer > > > Key: ARROW-2292 > URL: https://issues.apache.org/jira/browse/ARROW-2292 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could > call {{frombuffer}} something like {{py_buffer}} instead? -- This message was sent by Atlassian JIRA (v7.6.3#76005)