[jira] [Assigned] (ARROW-2454) [Python] Empty chunked array slice crashes
[ https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-2454: - Assignee: Antoine Pitrou > [Python] Empty chunked array slice crashes > -- > > Key: ARROW-2454 > URL: https://issues.apache.org/jira/browse/ARROW-2454 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > {code:python} > >>> col = pa.Column.from_array('ints', pa.array([1,2,3])) > >>> col > > chunk 0: > [ > 1, > 2, > 3 > ] > >>> col.data > > >>> col.data[:1] > > >>> col.data[:0] > Erreur de segmentation (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2422) Support more filter operators on Hive partitioned Parquet files
[ https://issues.apache.org/jira/browse/ARROW-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439186#comment-16439186 ] ASF GitHub Bot commented on ARROW-2422: --- jneuff commented on a change in pull request #1861: ARROW-2422 Support more operators for partition filtering URL: https://github.com/apache/arrow/pull/1861#discussion_r181675580 ## File path: python/pyarrow/tests/test_parquet.py ## @@ -997,40 +997,159 @@ def test_read_partitioned_directory(tmpdir): @parquet -def test_read_partitioned_directory_filtered(tmpdir): -fs = LocalFileSystem.get_instance() -base_path = str(tmpdir) - -import pyarrow.parquet as pq - -foo_keys = [0, 1] -bar_keys = ['a', 'b', 'c'] -partition_spec = [ -['foo', foo_keys], -['bar', bar_keys] -] -N = 30 - -df = pd.DataFrame({ -'index': np.arange(N), -'foo': np.array(foo_keys, dtype='i4').repeat(15), -'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), -'values': np.random.randn(N) -}, columns=['index', 'foo', 'bar', 'values']) - -_generate_partition_directories(fs, base_path, partition_spec, df) - -dataset = pq.ParquetDataset( -base_path, filesystem=fs, -filters=[('foo', '=', 1), ('bar', '!=', 'b')] +class TestParquetFilter: + +def test_equivalency(tmpdir): +fs = LocalFileSystem.get_instance() +base_path = str(tmpdir) + +import pyarrow.parquet as pq + +integer_keys = [0, 1] +string_keys = ['a', 'b', 'c'] +boolean_keys = [True, False] +partition_spec = [ +['integer', integer_keys], +['string', string_keys], +['boolean', boolean_keys] +] +N = 30 + +df = pd.DataFrame({ +'index': np.arange(N), +'integer': np.array(integer_keys, dtype='i4').repeat(15), +'string': np.tile(np.tile(np.array(string_keys, dtype=object), 5), 2), +'boolean': np.tile(np.tile(np.array(boolean_keys, dtype='bool'), 5), 3), +}, columns=['index', 'integer', 'string', 'boolean']) + +_generate_partition_directories(fs, base_path, partition_spec, df) + +dataset = pq.ParquetDataset( +base_path, filesystem=fs, +filters=[('integer', '=', 1), ('string', '!=', 'b'), ('boolean', '==', True)] +) +table = dataset.read() +result_df = (table.to_pandas() + .sort_values(by='index') + .reset_index(drop=True)) + +assert 0 not in result_df['integer'].values +assert 'b' not in result_df['string'].values +assert False not in result_df['boolean'].values + +def test_cutoff_exclusive_integer(tmpdir): +fs = LocalFileSystem.get_instance() +base_path = str(tmpdir) + +import pyarrow.parquet as pq + +integer_keys = [0, 1, 2, 3, 4] +partition_spec = [ +['integers', integer_keys], +] +N = 5 + +df = pd.DataFrame({ +'index': np.arange(N), +'integers': np.array(integer_keys, dtype='i4'), +}, columns=['index', 'integers']) + +_generate_partition_directories(fs, base_path, partition_spec, df) + +dataset = pq.ParquetDataset( +base_path, filesystem=fs, +filters=[ +('integers', '<', 4), +('integers', '>', 1), +] +) +table = dataset.read() +result_df = (table.to_pandas() + .sort_values(by='index') + .reset_index(drop=True)) + +result_list = [x for x in map(int, result_df['integers'].values)] +assert result_list == [2, 3] + +@pytest.mark.xfail( +raises=TypeError, reason='We suspect loss of type information in creation of categoricals.' ) Review comment: @xhochy This is the behavior we just told you about offline. `result_df['dates'].values` seems to be of type `object` instead of `datetime64`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support more filter operators on Hive partitioned Parquet files > --- > > Key: ARROW-2422 > URL: https://issues.apache.org/jira/browse/ARROW-2422 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Julius Neuffer >Priority: Minor > Labels: features, pull-request-available > > After implementing basic filters ('=', '!=') on Hive partitioned Parquet > f
[jira] [Commented] (ARROW-2454) [Python] Empty chunked array slice crashes
[ https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439211#comment-16439211 ] ASF GitHub Bot commented on ARROW-2454: --- pitrou opened a new pull request #1897: ARROW-2454: [C++] Allow zero-array chunked arrays URL: https://github.com/apache/arrow/pull/1897 This allows code to be more regular and less fragile. Also fix the chunked array slicing logic. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Empty chunked array slice crashes > -- > > Key: ARROW-2454 > URL: https://issues.apache.org/jira/browse/ARROW-2454 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> col = pa.Column.from_array('ints', pa.array([1,2,3])) > >>> col > > chunk 0: > [ > 1, > 2, > 3 > ] > >>> col.data > > >>> col.data[:1] > > >>> col.data[:0] > Erreur de segmentation (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2454) [Python] Empty chunked array slice crashes
[ https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2454: -- Labels: pull-request-available (was: ) > [Python] Empty chunked array slice crashes > -- > > Key: ARROW-2454 > URL: https://issues.apache.org/jira/browse/ARROW-2454 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > {code:python} > >>> col = pa.Column.from_array('ints', pa.array([1,2,3])) > >>> col > > chunk 0: > [ > 1, > 2, > 3 > ] > >>> col.data > > >>> col.data[:1] > > >>> col.data[:0] > Erreur de segmentation (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439224#comment-16439224 ] ASF GitHub Bot commented on ARROW-2101: --- pitrou closed pull request #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc b/cpp/src/arrow/python/numpy_to_arrow.cc index e37013c7e..dcb96a48a 100644 --- a/cpp/src/arrow/python/numpy_to_arrow.cc +++ b/cpp/src/arrow/python/numpy_to_arrow.cc @@ -228,12 +228,15 @@ static Status AppendObjectBinaries(PyArrayObject* arr, PyArrayObject* mask, /// can fit /// /// \param[in] offset starting offset for appending +/// \param[in] check_valid if set to true and the input array +/// contains values that cannot be converted to unicode, returns +/// a Status code containing a Python exception message /// \param[out] end_offset ending offset where we stopped appending. Will /// be length of arr if fully consumed /// \param[out] have_bytes true if we encountered any PyBytes object static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, int64_t offset, - StringBuilder* builder, int64_t* end_offset, - bool* have_bytes) { + bool check_valid, StringBuilder* builder, + int64_t* end_offset, bool* have_bytes) { PyObject* obj; Ndarray1DIndexer objects(arr); @@ -256,8 +259,7 @@ static Status AppendObjectStrings(PyArrayObject* arr, PyArrayObject* mask, int64 *have_bytes = true; } bool is_full; -RETURN_NOT_OK( -internal::BuilderAppend(builder, obj, false /* check_valid */, &is_full)); +RETURN_NOT_OK(internal::BuilderAppend(builder, obj, check_valid, &is_full)); if (is_full) { break; } @@ -844,6 +846,13 @@ Status NumPyConverter::ConvertObjectStrings() { StringBuilder builder(pool_); RETURN_NOT_OK(builder.Resize(length_)); + // If the creator of this NumPyConverter specified a type, + // then we want to force the output type to be utf8. If + // the input data is PyBytes and not PyUnicode and + // not convertible to utf8, the call to AppendObjectStrings + // below will fail because we pass force_string as the + // value for check_valid. + bool force_string = type_ != nullptr && type_->Equals(utf8()); bool global_have_bytes = false; if (length_ == 0) { // Produce an empty chunk @@ -854,8 +863,10 @@ Status NumPyConverter::ConvertObjectStrings() { int64_t offset = 0; while (offset < length_) { bool chunk_have_bytes = false; - RETURN_NOT_OK( - AppendObjectStrings(arr_, mask_, offset, &builder, &offset, &chunk_have_bytes)); + // Always set check_valid to true when force_string is true + RETURN_NOT_OK(AppendObjectStrings(arr_, mask_, offset, +force_string /* check_valid */, &builder, &offset, +&chunk_have_bytes)); global_have_bytes = global_have_bytes | chunk_have_bytes; std::shared_ptr chunk; @@ -864,8 +875,13 @@ Status NumPyConverter::ConvertObjectStrings() { } } - // If we saw PyBytes, convert everything to BinaryArray - if (global_have_bytes) { + // If we saw bytes, convert it to a binary array. If + // force_string was set to true, the input data could + // have been bytes but we've checked to make sure that + // it can be converted to utf-8 in the call to + // AppendObjectStrings. In that case, we can safely leave + // it as a utf8 type. + if (!force_string && global_have_bytes) { for (size_t i = 0; i < out_arrays_.size(); ++i) { auto binary_data = out_arrays_[i]->data()->Copy(); binary_data->type = ::arrow::binary(); @@ -1393,8 +1409,12 @@ inline Status NumPyConverter::ConvertTypedLists( RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); int64_t offset = 0; - RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, value_builder, &offset, -&have_bytes)); + // If a type was specified and it was utf8, then we set + // check_valid to true. If any of the input cannot be + // converted, then we will exit early here. + bool check_valid = type_ != nullptr && type_->Equals(::arrow::utf8()); + RETURN_NOT_OK(AppendObjectStrings(numpy_array, nullptr, 0, check_valid, +value_builder,
[jira] [Resolved] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2101. --- Resolution: Fixed Fix Version/s: 0.10.0 Issue resolved by pull request 1886 [https://github.com/apache/arrow/pull/1886] > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439227#comment-16439227 ] ASF GitHub Bot commented on ARROW-2101: --- pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381550267 Thank you @joshuastorck ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2463) [C++] Update flatbuffers to 1.9.0
[ https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn reassigned ARROW-2463: -- Assignee: Uwe L. Korn > [C++] Update flatbuffers to 1.9.0 > - > > Key: ARROW-2463 > URL: https://issues.apache.org/jira/browse/ARROW-2463 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > This will update externalproject and manylinux1 installations of Flatbuffers. > The conda-forge update is at > https://github.com/conda-forge/flatbuffers-feedstock/pull/9 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2463) [C++] Update flatbuffers to 1.9.0
Uwe L. Korn created ARROW-2463: -- Summary: [C++] Update flatbuffers to 1.9.0 Key: ARROW-2463 URL: https://issues.apache.org/jira/browse/ARROW-2463 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Uwe L. Korn Fix For: 0.10.0 This will update externalproject and manylinux1 installations of Flatbuffers. The conda-forge update is at https://github.com/conda-forge/flatbuffers-feedstock/pull/9 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2464) [Python] Use a python_version marker instead of a condition
Omer Katz created ARROW-2464: Summary: [Python] Use a python_version marker instead of a condition Key: ARROW-2464 URL: https://issues.apache.org/jira/browse/ARROW-2464 Project: Apache Arrow Issue Type: Task Components: Packaging, Python Affects Versions: 0.9.0 Reporter: Omer Katz When installing pyarrow 0.9.0 pipenv complains that futures has no matching versions. While that may be a bug in pipenv it does not matter. The standard way to specify a conditional dependency is using a marker. We should use the python_version marker to tell pip if it should install futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition
[ https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439407#comment-16439407 ] ASF GitHub Bot commented on ARROW-2464: --- thedrow commented on issue #1879: ARROW-2464: [Python] Use a python_version marker instead of a condition URL: https://github.com/apache/arrow/pull/1879#issuecomment-381591253 I opened a ticket. I can't change the branch name without opening a new PR. Is this sufficient? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use a python_version marker instead of a condition > --- > > Key: ARROW-2464 > URL: https://issues.apache.org/jira/browse/ARROW-2464 > Project: Apache Arrow > Issue Type: Task > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Omer Katz >Priority: Minor > Labels: pull-request-available > > When installing pyarrow 0.9.0 pipenv complains that futures has no matching > versions. > While that may be a bug in pipenv it does not matter. The standard way to > specify a conditional dependency is using a marker. > We should use the python_version marker to tell pip if it should install > futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2464) [Python] Use a python_version marker instead of a condition
[ https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2464: -- Labels: pull-request-available (was: ) > [Python] Use a python_version marker instead of a condition > --- > > Key: ARROW-2464 > URL: https://issues.apache.org/jira/browse/ARROW-2464 > Project: Apache Arrow > Issue Type: Task > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Omer Katz >Priority: Minor > Labels: pull-request-available > > When installing pyarrow 0.9.0 pipenv complains that futures has no matching > versions. > While that may be a bug in pipenv it does not matter. The standard way to > specify a conditional dependency is using a marker. > We should use the python_version marker to tell pip if it should install > futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2463) [C++] Update flatbuffers to 1.9.0
[ https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439461#comment-16439461 ] ASF GitHub Bot commented on ARROW-2463: --- xhochy opened a new pull request #1898: ARROW-2463: [C++] Update flatbuffers to 1.9.0 URL: https://github.com/apache/arrow/pull/1898 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Update flatbuffers to 1.9.0 > - > > Key: ARROW-2463 > URL: https://issues.apache.org/jira/browse/ARROW-2463 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > This will update externalproject and manylinux1 installations of Flatbuffers. > The conda-forge update is at > https://github.com/conda-forge/flatbuffers-feedstock/pull/9 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2463) [C++] Update flatbuffers to 1.9.0
[ https://issues.apache.org/jira/browse/ARROW-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2463: -- Labels: pull-request-available (was: ) > [C++] Update flatbuffers to 1.9.0 > - > > Key: ARROW-2463 > URL: https://issues.apache.org/jira/browse/ARROW-2463 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > This will update externalproject and manylinux1 installations of Flatbuffers. > The conda-forge update is at > https://github.com/conda-forge/flatbuffers-feedstock/pull/9 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition
[ https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439590#comment-16439590 ] ASF GitHub Bot commented on ARROW-2464: --- pitrou commented on issue #1879: ARROW-2464: [Python] Use a python_version marker instead of a condition URL: https://github.com/apache/arrow/pull/1879#issuecomment-381646788 Yes, it should be ok. Also thanks for explaining the bug on the JIRA issue. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use a python_version marker instead of a condition > --- > > Key: ARROW-2464 > URL: https://issues.apache.org/jira/browse/ARROW-2464 > Project: Apache Arrow > Issue Type: Task > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Omer Katz >Priority: Minor > Labels: pull-request-available > > When installing pyarrow 0.9.0 pipenv complains that futures has no matching > versions. > While that may be a bug in pipenv it does not matter. The standard way to > specify a conditional dependency is using a marker. > We should use the python_version marker to tell pip if it should install > futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2464) [Python] Use a python_version marker instead of a condition
[ https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2464. --- Resolution: Fixed Fix Version/s: 0.10.0 Issue resolved by pull request 1879 [https://github.com/apache/arrow/pull/1879] > [Python] Use a python_version marker instead of a condition > --- > > Key: ARROW-2464 > URL: https://issues.apache.org/jira/browse/ARROW-2464 > Project: Apache Arrow > Issue Type: Task > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Omer Katz >Priority: Minor > Labels: pull-request-available > Fix For: 0.10.0 > > > When installing pyarrow 0.9.0 pipenv complains that futures has no matching > versions. > While that may be a bug in pipenv it does not matter. The standard way to > specify a conditional dependency is using a marker. > We should use the python_version marker to tell pip if it should install > futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2464) [Python] Use a python_version marker instead of a condition
[ https://issues.apache.org/jira/browse/ARROW-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439591#comment-16439591 ] ASF GitHub Bot commented on ARROW-2464: --- pitrou closed pull request #1879: ARROW-2464: [Python] Use a python_version marker instead of a condition URL: https://github.com/apache/arrow/pull/1879 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/setup.py b/python/setup.py index 20b2416da4..8d26e092bc 100644 --- a/python/setup.py +++ b/python/setup.py @@ -447,10 +447,11 @@ def has_ext_modules(foo): return True -install_requires = ['numpy >= 1.10', 'six >= 1.0.0'] - -if sys.version_info.major == 2: -install_requires.append('futures') +install_requires = ( +'numpy >= 1.10', +'six >= 1.0.0', +'futures;python_version<"3.2"' +) def parse_version(root): This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Use a python_version marker instead of a condition > --- > > Key: ARROW-2464 > URL: https://issues.apache.org/jira/browse/ARROW-2464 > Project: Apache Arrow > Issue Type: Task > Components: Packaging, Python >Affects Versions: 0.9.0 >Reporter: Omer Katz >Priority: Minor > Labels: pull-request-available > Fix For: 0.10.0 > > > When installing pyarrow 0.9.0 pipenv complains that futures has no matching > versions. > While that may be a bug in pipenv it does not matter. The standard way to > specify a conditional dependency is using a marker. > We should use the python_version marker to tell pip if it should install > futures or not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2454) [Python] Empty chunked array slice crashes
[ https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-2454. --- Resolution: Fixed Fix Version/s: 0.10.0 Issue resolved by pull request 1897 [https://github.com/apache/arrow/pull/1897] > [Python] Empty chunked array slice crashes > -- > > Key: ARROW-2454 > URL: https://issues.apache.org/jira/browse/ARROW-2454 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > {code:python} > >>> col = pa.Column.from_array('ints', pa.array([1,2,3])) > >>> col > > chunk 0: > [ > 1, > 2, > 3 > ] > >>> col.data > > >>> col.data[:1] > > >>> col.data[:0] > Erreur de segmentation (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2454) [Python] Empty chunked array slice crashes
[ https://issues.apache.org/jira/browse/ARROW-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439595#comment-16439595 ] ASF GitHub Bot commented on ARROW-2454: --- pitrou closed pull request #1897: ARROW-2454: [C++] Allow zero-array chunked arrays URL: https://github.com/apache/arrow/pull/1897 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index b1cf6e59a2..0b9f75df19 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -43,7 +43,9 @@ class TestChunkedArray : public TestBase { protected: virtual void Construct() { one_ = std::make_shared(arrays_one_); -another_ = std::make_shared(arrays_another_); +if (!arrays_another_.empty()) { + another_ = std::make_shared(arrays_another_); +} } ArrayVector arrays_one_; @@ -121,6 +123,23 @@ TEST_F(TestChunkedArray, SliceEquals) { std::shared_ptr slice2 = one_->Slice(75)->Slice(25)->Slice(25, 50); ASSERT_EQ(slice2->length(), 50); test::AssertChunkedEqual(*slice, *slice2); + + // Making empty slices of a ChunkedArray + std::shared_ptr slice3 = one_->Slice(one_->length(), 99); + ASSERT_EQ(slice3->length(), 0); + ASSERT_EQ(slice3->num_chunks(), 0); + ASSERT_TRUE(slice3->type()->Equals(one_->type())); + + std::shared_ptr slice4 = one_->Slice(10, 0); + ASSERT_EQ(slice4->length(), 0); + ASSERT_EQ(slice4->num_chunks(), 0); + ASSERT_TRUE(slice4->type()->Equals(one_->type())); + + // Slicing an empty ChunkedArray + std::shared_ptr slice5 = slice4->Slice(0, 10); + ASSERT_EQ(slice5->length(), 0); + ASSERT_EQ(slice5->num_chunks(), 0); + ASSERT_TRUE(slice5->type()->Equals(one_->type())); } class TestColumn : public TestChunkedArray { diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index f6ac6dd3b1..8af47ea220 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -39,13 +39,25 @@ namespace arrow { ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { length_ = 0; null_count_ = 0; + DCHECK_GT(chunks.size(), 0) + << "cannot construct ChunkedArray from empty vector and omitted type"; + type_ = chunks[0]->type(); for (const std::shared_ptr& chunk : chunks) { length_ += chunk->length(); null_count_ += chunk->null_count(); } } -std::shared_ptr ChunkedArray::type() const { return chunks_[0]->type(); } +ChunkedArray::ChunkedArray(const ArrayVector& chunks, + const std::shared_ptr& type) +: chunks_(chunks), type_(type) { + length_ = 0; + null_count_ = 0; + for (const std::shared_ptr& chunk : chunks) { +length_ += chunk->length(); +null_count_ += chunk->null_count(); + } +} bool ChunkedArray::Equals(const ChunkedArray& other) const { if (length_ != other.length()) { @@ -107,20 +119,20 @@ std::shared_ptr ChunkedArray::Slice(int64_t offset, int64_t length DCHECK_LE(offset, length_); int curr_chunk = 0; - while (offset >= chunk(curr_chunk)->length()) { + while (curr_chunk < num_chunks() && offset >= chunk(curr_chunk)->length()) { offset -= chunk(curr_chunk)->length(); curr_chunk++; } ArrayVector new_chunks; - while (length > 0 && curr_chunk < num_chunks()) { + while (curr_chunk < num_chunks() && length > 0) { new_chunks.push_back(chunk(curr_chunk)->Slice(offset, length)); length -= chunk(curr_chunk)->length() - offset; offset = 0; curr_chunk++; } - return std::make_shared(new_chunks); + return std::make_shared(new_chunks, type_); } std::shared_ptr ChunkedArray::Slice(int64_t offset) const { @@ -129,15 +141,15 @@ std::shared_ptr ChunkedArray::Slice(int64_t offset) const { Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) : field_(field) { - data_ = std::make_shared(chunks); + data_ = std::make_shared(chunks, field->type()); } Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) : field_(field) { if (!data) { -data_ = std::make_shared(ArrayVector({})); +data_ = std::make_shared(ArrayVector({}), field->type()); } else { -data_ = std::make_shared(ArrayVector({data})); +data_ = std::make_shared(ArrayVector({data}), field->type()); } } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 20d027d6a5..32af224ff4 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -40,6 +40,7 @@ class Status; class ARROW_EXPORT ChunkedArray { public: explicit ChunkedArray(const ArrayVector& chunks); + ChunkedArray(const ArrayVector& chunks, const std::shared_ptr& type); /// \return the total length of the chunked array; computed on co
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439603#comment-16439603 ] Antoine Pitrou commented on ARROW-2372: --- This may have been fixed with ARROW-2369. Is there a possibility for you to test with Arrow git master? > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439646#comment-16439646 ] Kyle Barron commented on ARROW-2372: Sorry, I couldn't figure out how build Arrow and Parquet. I tried to follow [https://github.com/apache/arrow/blob/master/python/doc/source/development.rst] with Conda exactly, but I get errors. Specifically I think it's trying to use gcc 7.2.0 instead of 4.9. I might just have to wait for 9.1. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439735#comment-16439735 ] ASF GitHub Bot commented on ARROW-2101: --- BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381681329 Thanks for the clarification of Python 2 behaviour @xhochy , and thanks for the fix @joshuastorck ! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-2101: --- Assignee: (was: Bryan Cutler) > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439743#comment-16439743 ] ASF GitHub Bot commented on ARROW-2101: --- BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381682065 @joshuastorck , what is your JIRA username so I can assign the issue to you? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation
[ https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439807#comment-16439807 ] ASF GitHub Bot commented on ARROW-2430: --- kszucs commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based packaging automation URL: https://github.com/apache/arrow/pull/1869#issuecomment-381697835 @wesm [Updated.](https://github.com/kszucs/arrow/blob/6a2b126bcf99b051c5a852afaece01c60586f815/cd/crossbow.py) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > MVP for branch based packaging automation > - > > Key: ARROW-2430 > URL: https://issues.apache.org/jira/browse/ARROW-2430 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > Described in > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so
Antoine Pitrou created ARROW-2465: - Summary: [Plasma] plasma_store fails to find libarrow_gpu.so Key: ARROW-2465 URL: https://issues.apache.org/jira/browse/ARROW-2465 Project: Apache Arrow Issue Type: Bug Components: GPU, Plasma (C++) Affects Versions: 0.9.0 Reporter: Antoine Pitrou After install, I get the following: {code:bash} $ which plasma_store /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store $ plasma_store plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot open shared object file: No such file or directory $ ldd `which plasma_store` linux-vdso.so.1 => (0x7ffe7bdf) libarrow_gpu.so.0 => not found libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x7f5d81676000) libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x7f5d812ee000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x7f5d80dce000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000) /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000) {code} Note that {{libarrow_gpu.so}} is installed in {{/home/antoine/miniconda3/envs/pyarrow/lib/}} There are probably two solutions: * link statically with the Arrow GPU libs (I wonder why this isn't done like it is for the Arrow libs) * or make the rpath correct -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so
[ https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439808#comment-16439808 ] Antoine Pitrou commented on ARROW-2465: --- [~wapaul] > [Plasma] plasma_store fails to find libarrow_gpu.so > --- > > Key: ARROW-2465 > URL: https://issues.apache.org/jira/browse/ARROW-2465 > Project: Apache Arrow > Issue Type: Bug > Components: GPU, Plasma (C++) >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > > After install, I get the following: > {code:bash} > $ which plasma_store > /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store > $ plasma_store > plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot > open shared object file: No such file or directory > $ ldd `which plasma_store` > linux-vdso.so.1 => (0x7ffe7bdf) > libarrow_gpu.so.0 => not found > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x7f5d81676000) > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > (0x7f5d812ee000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000) > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 > (0x7f5d80dce000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000) > /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000) > {code} > Note that {{libarrow_gpu.so}} is installed in > {{/home/antoine/miniconda3/envs/pyarrow/lib/}} > There are probably two solutions: > * link statically with the Arrow GPU libs (I wonder why this isn't done like > it is for the Arrow libs) > * or make the rpath correct -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2442) [C++] Disambiguate Builder::Append overloads
[ https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439823#comment-16439823 ] ASF GitHub Bot commented on ARROW-2442: --- pitrou opened a new pull request #1900: ARROW-2442: [C++] Disambiguate builder Append() overloads URL: https://github.com/apache/arrow/pull/1900 Vector-style Append() methods are renamed AppendValues(). The original methods are marked deprecated. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Disambiguate Builder::Append overloads > > > Key: ARROW-2442 > URL: https://issues.apache.org/jira/browse/ARROW-2442 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner, pull-request-available > > See discussion in > [https://github.com/apache/arrow/pull/1852#discussion_r179919627] > There are various {{Append()}} overloads in Builder and subclasses, some of > which append one value, some of which append multiple values at once. > The API might be clearer and less error-prone if multiple-append variants > were named differently, for example {{AppendValues()}}. Especially with the > pointer-taking variants, it's probably easy to call the wrong overload by > mistake. > The existing methods would have to go through a deprecation cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2442) [C++] Disambiguate Builder::Append overloads
[ https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2442: -- Labels: beginner pull-request-available (was: beginner) > [C++] Disambiguate Builder::Append overloads > > > Key: ARROW-2442 > URL: https://issues.apache.org/jira/browse/ARROW-2442 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner, pull-request-available > > See discussion in > [https://github.com/apache/arrow/pull/1852#discussion_r179919627] > There are various {{Append()}} overloads in Builder and subclasses, some of > which append one value, some of which append multiple values at once. > The API might be clearer and less error-prone if multiple-append variants > were named differently, for example {{AppendValues()}}. Especially with the > pointer-taking variants, it's probably easy to call the wrong overload by > mistake. > The existing methods would have to go through a deprecation cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so
[ https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439825#comment-16439825 ] ASF GitHub Bot commented on ARROW-2465: --- pitrou opened a new pull request #1901: ARROW-2465: [Plasma/GPU] Preserve plasma_store rpath URL: https://github.com/apache/arrow/pull/1901 This allows it to find libarrow_gpu.so when installed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Plasma] plasma_store fails to find libarrow_gpu.so > --- > > Key: ARROW-2465 > URL: https://issues.apache.org/jira/browse/ARROW-2465 > Project: Apache Arrow > Issue Type: Bug > Components: GPU, Plasma (C++) >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > After install, I get the following: > {code:bash} > $ which plasma_store > /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store > $ plasma_store > plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot > open shared object file: No such file or directory > $ ldd `which plasma_store` > linux-vdso.so.1 => (0x7ffe7bdf) > libarrow_gpu.so.0 => not found > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x7f5d81676000) > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > (0x7f5d812ee000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000) > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 > (0x7f5d80dce000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000) > /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000) > {code} > Note that {{libarrow_gpu.so}} is installed in > {{/home/antoine/miniconda3/envs/pyarrow/lib/}} > There are probably two solutions: > * link statically with the Arrow GPU libs (I wonder why this isn't done like > it is for the Arrow libs) > * or make the rpath correct -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2442) [C++] Disambiguate Builder::Append overloads
[ https://issues.apache.org/jira/browse/ARROW-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439826#comment-16439826 ] ASF GitHub Bot commented on ARROW-2442: --- pitrou commented on issue #1900: ARROW-2442: [C++] Disambiguate builder Append() overloads URL: https://github.com/apache/arrow/pull/1900#issuecomment-381702739 Is it worth adding deprecation pragmas so that users of those functions get a compiler warning? See https://stackoverflow.com/questions/295120/c-mark-as-deprecated This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Disambiguate Builder::Append overloads > > > Key: ARROW-2442 > URL: https://issues.apache.org/jira/browse/ARROW-2442 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: beginner, pull-request-available > > See discussion in > [https://github.com/apache/arrow/pull/1852#discussion_r179919627] > There are various {{Append()}} overloads in Builder and subclasses, some of > which append one value, some of which append multiple values at once. > The API might be clearer and less error-prone if multiple-append variants > were named differently, for example {{AppendValues()}}. Especially with the > pointer-taking variants, it's probably easy to call the wrong overload by > mistake. > The existing methods would have to go through a deprecation cycle. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2465) [Plasma] plasma_store fails to find libarrow_gpu.so
[ https://issues.apache.org/jira/browse/ARROW-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2465: -- Labels: pull-request-available (was: ) > [Plasma] plasma_store fails to find libarrow_gpu.so > --- > > Key: ARROW-2465 > URL: https://issues.apache.org/jira/browse/ARROW-2465 > Project: Apache Arrow > Issue Type: Bug > Components: GPU, Plasma (C++) >Affects Versions: 0.9.0 >Reporter: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > After install, I get the following: > {code:bash} > $ which plasma_store > /home/antoine/miniconda3/envs/pyarrow/bin/plasma_store > $ plasma_store > plasma_store: error while loading shared libraries: libarrow_gpu.so.0: cannot > open shared object file: No such file or directory > $ ldd `which plasma_store` > linux-vdso.so.1 => (0x7ffe7bdf) > libarrow_gpu.so.0 => not found > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x7f5d81676000) > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > (0x7f5d812ee000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f5d80fe5000) > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 > (0x7f5d80dce000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f5d80a04000) > /lib64/ld-linux-x86-64.so.2 (0x7f5d81893000) > {code} > Note that {{libarrow_gpu.so}} is installed in > {{/home/antoine/miniconda3/envs/pyarrow/lib/}} > There are probably two solutions: > * link statically with the Arrow GPU libs (I wonder why this isn't done like > it is for the Arrow libs) > * or make the rpath correct -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1993) [Python] Add function for determining implied Arrow schema from pandas.DataFrame
[ https://issues.apache.org/jira/browse/ARROW-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1993: --- Labels: beginner (was: ) > [Python] Add function for determining implied Arrow schema from > pandas.DataFrame > > > Key: ARROW-1993 > URL: https://issues.apache.org/jira/browse/ARROW-1993 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > Currently the only option is to use {{Table/Array.from_pandas}} which does > significant unnecessary work and allocates memory. If only the schema is of > interest, then we could do less work and not allocate memory -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1993) [Python] Add function for determining implied Arrow schema from pandas.DataFrame
[ https://issues.apache.org/jira/browse/ARROW-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1993: --- Description: Currently the only option is to use {{Table/Array.from_pandas}} which does significant unnecessary work and allocates memory. If only the schema is of interest, then we could do less work and not allocate memory. We should provide the user a function {{pyarrow.Schema.from_pandas}} which takes a DataFrame as an input and returns the respective Arrow schema. was:Currently the only option is to use {{Table/Array.from_pandas}} which does significant unnecessary work and allocates memory. If only the schema is of interest, then we could do less work and not allocate memory > [Python] Add function for determining implied Arrow schema from > pandas.DataFrame > > > Key: ARROW-1993 > URL: https://issues.apache.org/jira/browse/ARROW-1993 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > Currently the only option is to use {{Table/Array.from_pandas}} which does > significant unnecessary work and allocates memory. If only the schema is of > interest, then we could do less work and not allocate memory. > We should provide the user a function {{pyarrow.Schema.from_pandas}} which > takes a DataFrame as an input and returns the respective Arrow schema. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1894) [Python] Treat CPython memoryview or buffer objects equivalently to pyarrow.Buffer in pyarrow.serialize
[ https://issues.apache.org/jira/browse/ARROW-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1894: --- Labels: beginner (was: ) > [Python] Treat CPython memoryview or buffer objects equivalently to > pyarrow.Buffer in pyarrow.serialize > --- > > Key: ARROW-1894 > URL: https://issues.apache.org/jira/browse/ARROW-1894 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > These should be treated as Buffer-like on serialize. We should consider how > to "box" the buffers as the appropriate kind of object (Buffer, memoryview, > etc.) when being deserialized -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1983: --- Description: Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file (mostly just schema information). It would be useful to add the ability to write a {{_metadata}} file as well. This should include information about each row group in the dataset, including summary statistics. Having this summary file would allow filtering of row groups without needing to access each file beforehand. This would require that the user is able to get the written RowGroups out of a {{pyarrow.parquet.write_table}} call and then give these objects as a list to new function that then passes them on as C++ objects to {{parquet-cpp}} that generates the respective {{_metadata}} file. was:Currently `pyarrow.parquet` can only write the `_common_metadata` file (mostly just schema information). It would be useful to add the ability to write a `_metadata` file as well. This should include information about each row group in the dataset, including summary statistics. Having this summary file would allow filtering of row groups without needing to access each file beforehand. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Major > Fix For: 0.10.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1731) [Python] Provide for selecting a subset of columns to convert in RecordBatch/Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1731: --- Labels: beginner (was: ) > [Python] Provide for selecting a subset of columns to convert in > RecordBatch/Table.from_pandas > -- > > Key: ARROW-1731 > URL: https://issues.apache.org/jira/browse/ARROW-1731 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > Currently it's all-or-nothing, and to do the subsetting in pandas incurs a > data copy. This would enable columns (by name or index) to be selected out > without additional data copying > cc [~cpcloud] [~jreback] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2082: -- Labels: pull-request-available (was: ) > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2082) [Python] SegFault in pyarrow.parquet.write_table with specific options
[ https://issues.apache.org/jira/browse/ARROW-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439852#comment-16439852 ] ASF GitHub Bot commented on ARROW-2082: --- joshuastorck opened a new pull request #456: ARROW-2082: Prevent segfault that was occurring when writing a nanosecond timestamp with arrow writer properties set to coerce timestamps and support deprecated int96 timestamps. URL: https://github.com/apache/parquet-cpp/pull/456 The bug was a due to the fact that the physical type was int64 but the WriteTimestamps function was taking a path that assumed the physical type was int96. This caused memory corruption because it was writing past the end of the array. The bug was fixed by checking that coerce timestamps is disabled when writing int96. A unit test was added for the regression. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] SegFault in pyarrow.parquet.write_table with specific options > -- > > Key: ARROW-2082 > URL: https://issues.apache.org/jira/browse/ARROW-2082 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: tested on MacOS High Sierra with python 3.6 and Ubuntu > Xenial (Python 3.5) >Reporter: Clément Bouscasse >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > I originally filed an issue in the pandas project but we've tracked it down > to arrow itself, when called via pandas in specific circumstances: > [https://github.com/pandas-dev/pandas/issues/19493] > basically using > {code:java} > df.to_parquet('filename.parquet', flavor='spark'){code} > gives a seg fault if `df` contains a datetime column. > Under the covers, pandas translates this to the following call: > {code:java} > pq.write_table(table, 'output.parquet', flavor='spark', compression='snappy', > coerce_timestamps='ms') > {code} > which gives me an instant crash. > There is a repro on the github ticket. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting
[ https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-799: -- Labels: beginner (was: ) > [Java] Provide guidance in documentation for using Arrow in an uberjar > setting > --- > > Key: ARROW-799 > URL: https://issues.apache.org/jira/browse/ARROW-799 > Project: Apache Arrow > Issue Type: Task >Reporter: Jingyuan Wang >Assignee: Li Jin >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > Currently, ArrowBuf class directly access the package-private fields of > AbstractByteBuf class which makes shading Apache Arrow problematic. If we > relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would > throw out IllegalAccessException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-889) [Python] Add nicer __repr__ for Column
[ https://issues.apache.org/jira/browse/ARROW-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-889: -- Labels: beginner (was: ) > [Python] Add nicer __repr__ for Column > -- > > Key: ARROW-889 > URL: https://issues.apache.org/jira/browse/ARROW-889 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1715) [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, Table
[ https://issues.apache.org/jira/browse/ARROW-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1715: --- Labels: beginner (was: ) > [Python] Implement pickling for Array, Column, ChunkedArray, RecordBatch, > Table > --- > > Key: ARROW-1715 > URL: https://issues.apache.org/jira/browse/ARROW-1715 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-799) [Java] Provide guidance in documentation for using Arrow in an uberjar setting
[ https://issues.apache.org/jira/browse/ARROW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-799: -- Issue Type: Improvement (was: Task) > [Java] Provide guidance in documentation for using Arrow in an uberjar > setting > --- > > Key: ARROW-799 > URL: https://issues.apache.org/jira/browse/ARROW-799 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Jingyuan Wang >Assignee: Li Jin >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > Currently, ArrowBuf class directly access the package-private fields of > AbstractByteBuf class which makes shading Apache Arrow problematic. If we > relocate io.netty namespace excluding io.netty.buffer.ArrowBuf, it would > throw out IllegalAccessException. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1388) [Python] Add Table.drop method for removing columns
[ https://issues.apache.org/jira/browse/ARROW-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-1388: --- Description: See ARROW-1374 for a use case. This function should take as an input a list of columns and return a new Table instance without them. (was: See ARROW-1374 for a use case) > [Python] Add Table.drop method for removing columns > --- > > Key: ARROW-1388 > URL: https://issues.apache.org/jira/browse/ARROW-1388 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > See ARROW-1374 for a use case. This function should take as an input a list > of columns and return a new Table instance without them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439856#comment-16439856 ] ASF GitHub Bot commented on ARROW-2101: --- joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381708268 @BryanCutler, my JIRA username is joshuastorck This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2057) [Python] Configure size of data pages in pyarrow.parquet.write_table
[ https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe L. Korn updated ARROW-2057: --- Description: It would be useful to be able to set the size of data pages (within Parquet column chunks) from Python. The current default is set to 1MiB at https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81. It might be useful in some situations to lower this for more granular access. We should provide this value as a parameter to {{pyarrow.parquet.write_table}}. was:It would be useful to be able to set the size of data pages (within Parquet column chunks) from Python > [Python] Configure size of data pages in pyarrow.parquet.write_table > > > Key: ARROW-2057 > URL: https://issues.apache.org/jira/browse/ARROW-2057 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: beginner > Fix For: 0.10.0 > > > It would be useful to be able to set the size of data pages (within Parquet > column chunks) from Python. The current default is set to 1MiB at > https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81. > It might be useful in some situations to lower this for more granular access. > We should provide this value as a parameter to > {{pyarrow.parquet.write_table}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439884#comment-16439884 ] Antoine Pitrou commented on ARROW-2372: --- Ok, I have downloaded the dataset and confirms that it works on git master. > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439885#comment-16439885 ] Kyle Barron commented on ARROW-2372: Awesome thanks! > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-2372) ArrowIOError: Invalid argument
[ https://issues.apache.org/jira/browse/ARROW-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kyle Barron closed ARROW-2372. -- Resolution: Fixed > ArrowIOError: Invalid argument > -- > > Key: ARROW-2372 > URL: https://issues.apache.org/jira/browse/ARROW-2372 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Ubuntu 16.04 >Reporter: Kyle Barron >Priority: Major > Fix For: 0.9.1 > > > I get an ArrowIOError when reading a specific file that was also written by > pyarrow. Specifically, the traceback is: > {code:python} > >>> import pyarrow.parquet as pq > >>> pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > --- > ArrowIOError Traceback (most recent call last) > in () > > 1 pf = pq.ParquetFile('gaz2016zcta5distancemiles.parquet') > ~/local/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py in > _init_(self, source, metadata, common_metadata) > 62 self.reader = ParquetReader() > 63 source = _ensure_file(source) > ---> 64 self.reader.open(source, metadata=metadata) > 65 self.common_metadata = common_metadata > 66 self._nested_paths_by_prefix = self._build_nested_paths() > _parquet.pyx in pyarrow._parquet.ParquetReader.open() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Arrow error: IOError: [Errno 22] Invalid argument > {code} > Here's a reproducible example with the specific file I'm working with. I'm > converting a 34 GB csv file to parquet in chunks of roughly 2GB each. To get > the source data: > {code:bash} > wget > https://www.nber.org/distance/2016/gaz/zcta5/gaz2016zcta5distancemiles.csv.zip > unzip gaz2016zcta5distancemiles.csv.zip{code} > Then the basic idea from the [pyarrow Parquet > documentation|https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing] > is instantiating the writer class; looping over chunks of the csv and > writing them to parquet; then closing the writer object. > > {code:python} > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from pathlib import Path > zcta_file = Path('gaz2016zcta5distancemiles.csv') > itr = pd.read_csv( > zcta_file, > header=0, > dtype={'zip1': str, 'zip2': str, 'mi_to_zcta5': np.float64}, > engine='c', > chunksize=64617153) > schema = pa.schema([ > pa.field('zip1', pa.string()), > pa.field('zip2', pa.string()), > pa.field('mi_to_zcta5', pa.float64())]) > writer = pq.ParquetWriter('gaz2016zcta5distancemiles.parquet', schema=schema) > print(f'Starting conversion') > i = 0 > for df in itr: > i += 1 > print(f'Finished reading csv block {i}') > table = pa.Table.from_pandas(df, preserve_index=False, nthreads=3) > writer.write_table(table) > print(f'Finished writing parquet block {i}') > writer.close() > {code} > Then running this python script produces the file > {code:java} > gaz2016zcta5distancemiles.parquet{code} > , but just attempting to read the metadata with `pq.ParquetFile()` produces > the above exception. > I tested this with pyarrow 0.8 and pyarrow 0.9. I assume that pandas would > complain on import of the csv if the columns in the data were not `string`, > `string`, and `float64`, so I think creating the Parquet schema in that way > should be fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439892#comment-16439892 ] Joshua Storck commented on ARROW-2429: -- If you invoke the write_table function as follows, the type will not change: {code:python} pq.write_table(table, 'foo.parquet', use_deprecated_int96_timestamps=True) {code} > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > Minimal example: > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439912#comment-16439912 ] ASF GitHub Bot commented on ARROW-2101: --- BryanCutler commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381719872 It looks like you need to be given rights to have issues assigned, and I guess I'm not able to do that. @pitrou or @xhochy , would you mind doing this? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
[ https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439914#comment-16439914 ] ASF GitHub Bot commented on ARROW-2101: --- pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert numpy arrays of bytes to arrow arrays of strings when user specifies arrow type of string URL: https://github.com/apache/arrow/pull/1886#issuecomment-381720123 I'm not able to it either, but I think @xhochy is :-) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] from_pandas reads 'str' type as binary Arrow data with Python 2 > > > Key: ARROW-2101 > URL: https://issues.apache.org/jira/browse/ARROW-2101 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Bryan Cutler >Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow > data of binary type, even if the user supplies type information. conversion > of 'unicode' type works to create Arrow data of string types. For example > {code} > In [25]: pa.Array.from_pandas(pd.Series(['a'])).type > Out[25]: DataType(binary) > In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type > Out[26]: DataType(binary) > In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type > Out[27]: DataType(string) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2393) [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK
[ https://issues.apache.org/jira/browse/ARROW-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439950#comment-16439950 ] Joshua Storck commented on ARROW-2393: -- I don't think the ARROW_CHECK_OK and ARROW_CHECK_OK_PREPEND macros should be in status.h. They use the logging facilities and should probably be in logging.h, which shouldn't be visible. The interesting thing is that the RETURN_NOT_OK macros don't work outside of the arrow namespace. I think they need to be updated to use ::arrow::Status in their bodies. [~wesmckinn], [~pitrou], or [~cpcloud], does that make sense? If so, I'll submit a PR. > [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK > -- > > Key: ARROW-2393 > URL: https://issues.apache.org/jira/browse/ARROW-2393 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: dennis lucero >Priority: Trivial > > test.cpp > {code:c++} > #include > int main(void) { > arrow::Int64Builder i64builder; > std::shared_ptr i64array; > ARROW_CHECK_OK(i64builder.Finish(&i64array)); > return EXIT_SUCCESS; > } > {code} > Attempt to build: > {code:bash} > $CXX test.cpp -std=c++11 -larrow > {code} > Error: > {code} > test.cpp:6:2: error: use of undeclared identifier 'ARROW_CHECK' > ARROW_CHECK_OK(i64builder.Finish(&i64array)); ^ > xxx/include/arrow/status.h:49:27: note: expanded from macro 'ARROW_CHECK_OK' > #define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") ^ > xxx/include/arrow/status.h:44:5: note: expanded from macro > 'ARROW_CHECK_OK_PREPEND' ARROW_CHECK(_s.ok()) << (msg) << ": " << > _s.ToString(); \ ^ 1 error generated. > {code} > I expect that ARROW_* macro are public API, and should work out of the box. > A naive attempt to fix it > {code} > diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h > index 84f55e41..6da4a773 100644 > --- a/cpp/src/arrow/status.h > +++ b/cpp/src/arrow/status.h > @@ -25,6 +25,7 @@ > #include "arrow/util/macros.h" > #include "arrow/util/visibility.h" > +#include "arrow/util/logging.h" > // Return the given status if it is not OK. > #define ARROW_RETURN_NOT_OK(s) \ > {code} > fails with > {code} > public-api-test.cc:21:2: error: "DCHECK should not be visible from Arrow > public headers." > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2393) [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK
[ https://issues.apache.org/jira/browse/ARROW-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439953#comment-16439953 ] Phillip Cloud commented on ARROW-2393: -- That sounds right to me. > [C++] arrow/status.h does not define ARROW_CHECK needed for ARROW_CHECK_OK > -- > > Key: ARROW-2393 > URL: https://issues.apache.org/jira/browse/ARROW-2393 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.9.0 >Reporter: dennis lucero >Priority: Trivial > > test.cpp > {code:c++} > #include > int main(void) { > arrow::Int64Builder i64builder; > std::shared_ptr i64array; > ARROW_CHECK_OK(i64builder.Finish(&i64array)); > return EXIT_SUCCESS; > } > {code} > Attempt to build: > {code:bash} > $CXX test.cpp -std=c++11 -larrow > {code} > Error: > {code} > test.cpp:6:2: error: use of undeclared identifier 'ARROW_CHECK' > ARROW_CHECK_OK(i64builder.Finish(&i64array)); ^ > xxx/include/arrow/status.h:49:27: note: expanded from macro 'ARROW_CHECK_OK' > #define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") ^ > xxx/include/arrow/status.h:44:5: note: expanded from macro > 'ARROW_CHECK_OK_PREPEND' ARROW_CHECK(_s.ok()) << (msg) << ": " << > _s.ToString(); \ ^ 1 error generated. > {code} > I expect that ARROW_* macro are public API, and should work out of the box. > A naive attempt to fix it > {code} > diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h > index 84f55e41..6da4a773 100644 > --- a/cpp/src/arrow/status.h > +++ b/cpp/src/arrow/status.h > @@ -25,6 +25,7 @@ > #include "arrow/util/macros.h" > #include "arrow/util/visibility.h" > +#include "arrow/util/logging.h" > // Return the given status if it is not OK. > #define ARROW_RETURN_NOT_OK(s) \ > {code} > fails with > {code} > public-api-test.cc:21:2: error: "DCHECK should not be visible from Arrow > public headers." > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)