[jira] [Commented] (ARROW-2262) [Python] Support slicing on pyarrow.ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391042#comment-16391042 ] ASF GitHub Bot commented on ARROW-2262: --- pitrou commented on a change in pull request #1702: ARROW-2262: [Python] Support slicing on pyarrow.ChunkedArray URL: https://github.com/apache/arrow/pull/1702#discussion_r173120083 ## File path: python/pyarrow/table.pxi ## @@ -77,6 +77,52 @@ cdef class ChunkedArray: self._check_nullptr() return self.chunked_array.null_count() +def __getitem__(self, key): +cdef int64_t item +cdef int i +self._check_nullptr() +if isinstance(key, slice): +return _normalize_slice(self, key) +elif isinstance(key, six.integer_types): +item = key +if item >= self.chunked_array.length() or item < 0: +return IndexError("ChunkedArray selection out of bounds") Review comment: If we allow negative slice bounds, I would expect us to also allow negative indices. Seems like it's time for a `_normalize_index` function? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Support slicing on pyarrow.ChunkedArray > > > Key: ARROW-2262 > URL: https://issues.apache.org/jira/browse/ARROW-2262 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2288) [Python] slicing logic defective
Antoine Pitrou created ARROW-2288: - Summary: [Python] slicing logic defective Key: ARROW-2288 URL: https://issues.apache.org/jira/browse/ARROW-2288 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The slicing logic tends to go too far when normalizing large negative bounds, which leads to results not in line with Python's slicing semantics: {code} >>> arr = pa.array([1,2,3,4]) >>> arr[-99:100] [ 2, 3, 4 ] >>> arr.to_pylist()[-99:100] [1, 2, 3, 4] >>> >>> >>> arr[-6:-5] [ 3 ] >>> arr.to_pylist()[-6:-5] [] {code} Also note this crash: {code} >>> arr[10:13] /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= (data.length) Abandon (core dumped) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391052#comment-16391052 ] Antoine Pitrou commented on ARROW-2288: --- As for the crash: since {{Array::Slice}} adjusts the length when too large, it would make sense for it to also adjust the offset instead of crashing, IMO. > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391119#comment-16391119 ] ASF GitHub Bot commented on ARROW-2288: --- pitrou opened a new pull request #1723: ARROW-2288: [Python] Fix slicing logic URL: https://github.com/apache/arrow/pull/1723 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2288: -- Labels: pull-request-available (was: ) > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2284) [Python] test_plasma error on plasma_store error
[ https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391130#comment-16391130 ] ASF GitHub Bot commented on ARROW-2284: --- pitrou opened a new pull request #1724: ARROW-2284: [Python] Fix error display on test_plasma error URL: https://github.com/apache/arrow/pull/1724 Just a trivial fix. stderr is captured by py.test, not by the subprocess call. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] test_plasma error on plasma_store error > > > Key: ARROW-2284 > URL: https://issues.apache.org/jira/browse/ARROW-2284 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > > This appears caused by my latest changes: > {code:python} > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, > in setup_method > plasma_store_name, self.p = self.plasma_store_ctx.__enter__() > File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", > line 81, in __enter__ > return next(self.gen) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, > in start_plasma_store > err = proc.stderr.read().decode() > AttributeError: 'NoneType' object has no attribute 'read' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2284) [Python] test_plasma error on plasma_store error
[ https://issues.apache.org/jira/browse/ARROW-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2284: -- Labels: pull-request-available (was: ) > [Python] test_plasma error on plasma_store error > > > Key: ARROW-2284 > URL: https://issues.apache.org/jira/browse/ARROW-2284 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Trivial > Labels: pull-request-available > > This appears caused by my latest changes: > {code:python} > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 192, > in setup_method > plasma_store_name, self.p = self.plasma_store_ctx.__enter__() > File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.6/contextlib.py", > line 81, in __enter__ > return next(self.gen) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 168, > in start_plasma_store > err = proc.stderr.read().decode() > AttributeError: 'NoneType' object has no attribute 'read' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2288) [Python] slicing logic defective
[ https://issues.apache.org/jira/browse/ARROW-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391158#comment-16391158 ] ASF GitHub Bot commented on ARROW-2288: --- pitrou commented on issue #1723: ARROW-2288: [Python] Fix slicing logic URL: https://github.com/apache/arrow/pull/1723#issuecomment-371470276 AppVeyor at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.173 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] slicing logic defective > > > Key: ARROW-2288 > URL: https://issues.apache.org/jira/browse/ARROW-2288 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > The slicing logic tends to go too far when normalizing large negative bounds, > which leads to results not in line with Python's slicing semantics: > {code} > >>> arr = pa.array([1,2,3,4]) > >>> arr[-99:100] > > [ > 2, > 3, > 4 > ] > >>> arr.to_pylist()[-99:100] > [1, 2, 3, 4] > >>> > >>> > >>> arr[-6:-5] > > [ > 3 > ] > >>> arr.to_pylist()[-6:-5] > [] > {code} > Also note this crash: > {code} > >>> arr[10:13] > /home/antoine/arrow/cpp/src/arrow/array.cc:105 Check failed: (offset) <= > (data.length) > Abandon (core dumped) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391179#comment-16391179 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#issuecomment-371474232 Rebased. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2135) [Python] NaN values silently casted to int64 when passing explicit schema for conversion in Table.from_pandas
[ https://issues.apache.org/jira/browse/ARROW-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391241#comment-16391241 ] ASF GitHub Bot commented on ARROW-2135: --- pitrou commented on issue #1681: ARROW-2135: [Python] Fix NaN conversion when casting from Numpy array URL: https://github.com/apache/arrow/pull/1681#issuecomment-371484306 AppVeyor at https://ci.appveyor.com/project/pitrou/arrow/build/1.0.175 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] NaN values silently casted to int64 when passing explicit schema for > conversion in Table.from_pandas > - > > Key: ARROW-2135 > URL: https://issues.apache.org/jira/browse/ARROW-2135 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Matthew Gilbert >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > If you create a {{Table}} from a {{DataFrame}} of ints with a NaN value the > NaN is improperly cast. Since pandas casts these to floats, when converted to > a table the NaN is interpreted as an integer. This seems like a bug since a > known limitation in pandas (the inability to have null valued integers data) > is taking precedence over arrow's functionality to store these as an IntArray > with nulls. > > {code} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({"a":[1, 2, pd.np.NaN]}) > schema = pa.schema([pa.field("a", pa.int64(), nullable=True)]) > table = pa.Table.from_pandas(df, schema=schema) > table[0] > > chunk 0: > [ > 1, > 2, > -9223372036854775808 > ]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391292#comment-16391292 ] Antoine Pitrou commented on ARROW-1974: --- The problem here is that {{FileReader::Impl::ReadTable}} creates a {{Table}} with a schema that has one more field than the number of physical columns. The underlying cause seems to be that {{ColumnIndicesToFieldIndices}} uses {{Group::FieldIndex}} which looks up the field by name... Also {{Group::Equals}} has a bit surprising semantics (why doesn't {{GroupNode::FieldIndex(const Node& node)}} simply look up the node by pointer equality?). > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2267) Rust bindings
[ https://issues.apache.org/jira/browse/ARROW-2267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391302#comment-16391302 ] Joshua Howard commented on ARROW-2267: -- I spent some time looking into the cpp implementation and it seems like the initial steps should be to port the following objects to rust. # MemoryPool # Buffer # Builder # Array The biggest divergence from cpp that I see is the implementation of the memory pool. Implementing MemoryPool would result in unsafe code being used in Rust (which is bad obviously). There is an issue open to modify memory alignment of structs: [https://github.com/rust-lang/rust/issues/33626.] I think that it would be well worth skipping the memory alignment until this development is finished. > Rust bindings > - > > Key: ARROW-2267 > URL: https://issues.apache.org/jira/browse/ARROW-2267 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Joshua Howard >Priority: Major > > Provide Rust bindings for Arrow. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1974: -- Labels: pull-request-available (was: ) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391353#comment-16391353 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou opened a new pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391355#comment-16391355 ] Antoine Pitrou commented on ARROW-1974: --- With https://github.com/apache/parquet-cpp/pull/447, the {{to_pandas()}} call will fail with the following error: {code:python} File "table.pxi", line 1059, in pyarrow.lib.Table.to_pandas File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 611, in table_to_blockmanager columns = _flatten_single_level_multiindex(columns) File "/home/antoine/arrow/python/pyarrow/pandas_compat.py", line 673, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index {code} > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391413#comment-16391413 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371525784 Thanks for doing this. Will review shortly This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391418#comment-16391418 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371527191 Unfortunately this doesn't seem sufficient. If I add the following test, I get an error and a crash: ```diff diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc b/src/parquet/arrow/arrow-reader-writer-test.cc index 72e65d4..eb5a8ec 100644 --- a/src/parquet/arrow/arrow-reader-writer-test.cc +++ b/src/parquet/arrow/arrow-reader-writer-test.cc @@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) { } } +TEST(TestArrowReadWrite, TableWithDuplicateColumns) { + using ::arrow::ArrayFromVector; + + auto f0 = field("duplicate", ::arrow::int8()); + auto f1 = field("duplicate", ::arrow::int16()); + auto schema = ::arrow::schema({f0, f1}); + + std::vector a0_values = {1, 2, 3}; + std::vector a1_values = {14, 15, 16}; + + std::shared_ptr a0, a1; + + ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, &a0); + ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, &a1); + + auto table = Table::Make(schema, + {std::make_shared(f0->name(), a0), +std::make_shared(f1->name(), a1)}); + CheckSimpleRoundtrip(table, table->num_rows()); +} + TEST(TestArrowWrite, CheckChunkSize) { const int num_columns = 2; const int num_rows = 128; ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391449#comment-16391449 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on issue #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-371534052 Ok, the reason for the error is that a similar pattern needs fixing in `SchemaDescriptor`. Updating shortly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2239) [C++] Update build docs for Windows
[ https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391505#comment-16391505 ] ASF GitHub Bot commented on ARROW-2239: --- wesm closed pull request #1722: ARROW-2239: [C++] Update Windows build docs URL: https://github.com/apache/arrow/pull/1722 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/apidoc/Windows.md b/cpp/apidoc/Windows.md index dae5040c2..965369521 100644 --- a/cpp/apidoc/Windows.md +++ b/cpp/apidoc/Windows.md @@ -44,9 +44,8 @@ Now, you can bootstrap a build environment conda create -n arrow-dev cmake git boost-cpp flatbuffers rapidjson cmake thrift-cpp snappy zlib brotli gflags lz4-c zstd -c conda-forge ``` -***Note:*** -> *Make sure to get the `conda-forge` build of `gflags` as the - naming of the library differs from that in the `defaults` channel* +> **Note:** Make sure to get the `conda-forge` build of `gflags` as the +> naming of the library differs from that in the `defaults` channel. Activate just created conda environment with pre-installed packages from previous step: @@ -116,52 +115,85 @@ zstd%ZSTD_SUFFIX%.lib. ### Visual Studio Microsoft provides the free Visual Studio Community edition. When doing -development, you must launch the developer command prompt using +development, you must launch the developer command prompt using: Visual Studio 2015 -```"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64``` +``` +"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" amd64 +``` Visual Studio 2017 -```"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64``` +``` +"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64 +``` It's easiest to configure a console emulator like [cmder][3] to automatically launch this when starting a new development console. +## Building with Ninja and clcache + +We recommend the [Ninja](https://ninja-build.org/) build system for better +build parallelization, and the optional +[clcache](https://github.com/frerich/clcache/) compiler cache which keeps +track of past compilations to avoid running them over and over again +(in a way similar to the Unix-specific "ccache"). + +Activate your conda build environment to first install those utilities: + +```shell +activate arrow-dev + +conda install -c conda-forge ninja +pip install git+https://github.com/frerich/clcache.git +``` + +Change working directory in cmd.exe to the root directory of Arrow and +do an out of source build by generating Ninja files: + +```shell +cd cpp +mkdir build +cd build +cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release .. +cmake --build . --config Release +``` + ## Building with NMake Activate your conda build environment: -``` +```shell activate arrow-dev ``` Change working directory in cmd.exe to the root directory of Arrow and do an out of source build using `nmake`: -``` +```shell cd cpp mkdir build cd build cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release .. +cmake --build . --config Release nmake ``` When using conda, only release builds are currently supported. -## Build using Visual Studio (MSVC) Solution Files +## Building using Visual Studio (MSVC) Solution Files Activate your conda build environment: -``` +```shell activate arrow-dev ``` Change working directory in cmd.exe to the root directory of Arrow and do an out of source build by generating a MSVC solution: -``` +```shell cd cpp mkdir build cd build @@ -171,10 +203,11 @@ cmake --build . --config Release ## Debug build -To build Debug version of Arrow you should have pre-insalled Debug version of -boost libs. +To build Debug version of Arrow you should have pre-installed a Debug version +of boost libs. -It's recommended to configure cmake build with following variables for Debug build: +It's recommended to configure cmake build with the following variables for +Debug build: `-DARROW_BOOST_USE_SHARED=OFF` - enables static linking with boost debug libs and simplifies run-time loading of 3rd parties. (Recommended) @@ -185,7 +218,7 @@ simplifies run-time loading of 3rd parties. (Recommended) Command line to build Arrow in Debug might look as following: -``` +```shell cd cpp mkdir build cd build This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [C++] Upd
[jira] [Resolved] (ARROW-2239) [C++] Update build docs for Windows
[ https://issues.apache.org/jira/browse/ARROW-2239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2239. - Resolution: Fixed Issue resolved by pull request 1722 [https://github.com/apache/arrow/pull/1722] > [C++] Update build docs for Windows > --- > > Key: ARROW-2239 > URL: https://issues.apache.org/jira/browse/ARROW-2239 > Project: Apache Arrow > Issue Type: Task > Components: C++, Documentation >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > We should update the C++ build docs for Windows to recommend use of Ninja and > clcache for faster builds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2289: -- Labels: pull-request-available (was: ) > [GLib] Add Numeric, Integer and FloatingPoint data types > - > > Key: ARROW-2289 > URL: https://issues.apache.org/jira/browse/ARROW-2289 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Affects Versions: 0.8.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391540#comment-16391540 ] ASF GitHub Bot commented on ARROW-2289: --- kou opened a new pull request #1726: ARROW-2289: [GLib] Add Numeric, Integer, FloatingPoint data types URL: https://github.com/apache/arrow/pull/1726 They are useful to detect numeric data types. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [GLib] Add Numeric, Integer and FloatingPoint data types > - > > Key: ARROW-2289 > URL: https://issues.apache.org/jira/browse/ARROW-2289 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Affects Versions: 0.8.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
Kouhei Sutou created ARROW-2289: --- Summary: [GLib] Add Numeric, Integer and FloatingPoint data types Key: ARROW-2289 URL: https://issues.apache.org/jira/browse/ARROW-2289 Project: Apache Arrow Issue Type: Improvement Components: GLib Affects Versions: 0.8.0 Reporter: Kouhei Sutou Assignee: Kouhei Sutou Fix For: 0.9.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support
[ https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2038: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] Follow-up bug fixes for s3fs Parquet support > - > > Key: ARROW-2038 > URL: https://issues.apache.org/jira/browse/ARROW-2038 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > see discussion in > https://github.com/apache/arrow/pull/916#issuecomment-360558248 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1975) [C++] Add abi-compliance-checker to build process
[ https://issues.apache.org/jira/browse/ARROW-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1975: Fix Version/s: (was: 0.9.0) 0.10.0 > [C++] Add abi-compliance-checker to build process > - > > Key: ARROW-1975 > URL: https://issues.apache.org/jira/browse/ARROW-1975 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > I would like to check our baseline modules with > https://lvc.github.io/abi-compliance-checker/ to ensure that version upgrades > are much smoother and that we don‘t break the ABI in patch releases. > As we‘re pre-1.0 yet, I accept that there will be breakage but I would like > to keep them to a minimum. Currently the biggest pain with Arrow is you need > to pin it in Python always with {{==0.x.y}}, otherwise segfaults are > inevitable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types
[ https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1988: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] Extend flavor=spark in Parquet writing to handle INT types > --- > > Key: ARROW-1988 > URL: https://issues.apache.org/jira/browse/ARROW-1988 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Fix For: 0.10.0 > > > See the relevant code sections at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139 > We should cater for them in the {{pyarrow}} code and also reach out to Spark > developers so that they are supported there in the longterm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2014) [Python] Document read_pandas method in pyarrow.parquet
[ https://issues.apache.org/jira/browse/ARROW-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2014: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] Document read_pandas method in pyarrow.parquet > --- > > Key: ARROW-2014 > URL: https://issues.apache.org/jira/browse/ARROW-2014 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.10.0 > > > see discussion in https://github.com/apache/arrow/issues/1302 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1454) [Python] More informative error message when attempting to write an unsupported Arrow type to Parquet format
[ https://issues.apache.org/jira/browse/ARROW-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1454: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] More informative error message when attempting to write an > unsupported Arrow type to Parquet format > > > Key: ARROW-1454 > URL: https://issues.apache.org/jira/browse/ARROW-1454 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See https://github.com/pandas-dev/pandas/issues/17102#issuecomment-326746184 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1974: --- Assignee: Antoine Pitrou (was: Phillip Cloud) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2256) [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos
[ https://issues.apache.org/jira/browse/ARROW-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2256: Fix Version/s: (was: 0.9.0) 0.10.0 > [C++] Fuzzer builds fail out of the box on Ubuntu 16.04 using LLVM apt repos > > > Key: ARROW-2256 > URL: https://issues.apache.org/jira/browse/ARROW-2256 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > I did a clean upgrade to 16.04 on one of my machine and ran into the problem > described here: > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866087 > I think this can be resolved temporarily by symlinking the static library, > but we should document the problem so other devs know what to do when it > happens -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2263: Fix Version/s: (was: 0.9.0) 0.10.0 > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > import pyarrow as pa > ModuleNotFoundError: No module named 'pyarrow' > == 1 failed in 0.23 seconds > === > {code} > I encountered this bit of brittleness in a fresh install where I had not run > {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support
[ https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391805#comment-16391805 ] Wes McKinney commented on ARROW-2038: - Moving this to 0.10.0, but please feel free to look sooner > [Python] Follow-up bug fixes for s3fs Parquet support > - > > Key: ARROW-2038 > URL: https://issues.apache.org/jira/browse/ARROW-2038 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > see discussion in > https://github.com/apache/arrow/pull/916#issuecomment-360558248 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1974: --- Assignee: Antoine Pitrou (was: Wes McKinney) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1974: --- Assignee: Wes McKinney (was: Antoine Pitrou) > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1425: --- Assignee: Wes McKinney (was: Li Jin) > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2289: --- Assignee: Wes McKinney (was: Kouhei Sutou) > [GLib] Add Numeric, Integer and FloatingPoint data types > - > > Key: ARROW-2289 > URL: https://issues.apache.org/jira/browse/ARROW-2289 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Affects Versions: 0.8.0 >Reporter: Kouhei Sutou >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1425: --- Assignee: Li Jin (was: Wes McKinney) > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Li Jin >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1996) [Python] pyarrow.read_serialized cannot read concatenated records
[ https://issues.apache.org/jira/browse/ARROW-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1996: --- Assignee: Antoine Pitrou (was: Wes McKinney) > [Python] pyarrow.read_serialized cannot read concatenated records > - > > Key: ARROW-1996 > URL: https://issues.apache.org/jira/browse/ARROW-1996 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux >Reporter: Richard Shin >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The following code > {quote}import pyarrow as pa > f = pa.OSFile('arrow_test', 'w') > pa.serialize_to(12, f) > pa.serialize_to(23, f) > f.close() > f = pa.OSFile('arrow_test', 'r') > print(pa.read_serialized(f).deserialize()) > print(pa.read_serialized(f).deserialize()) > f.close() > {quote} > gives the following result: > {quote}$ python pyarrow_test.py > First: 12 > Traceback (most recent call last): > File "pyarrow_test.py", line 10, in > print('Second: {}'.format(pa.read_serialized(f).deserialize())) > File "pyarrow/serialization.pxi", line 347, in pyarrow.lib.read_serialized > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:79159) > File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8270) > pyarrow.lib.ArrowInvalid: Expected schema message in stream, was null or > length 0 > {quote} > I would have expected read_serialized to sucessfully read the second value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1996) [Python] pyarrow.read_serialized cannot read concatenated records
[ https://issues.apache.org/jira/browse/ARROW-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1996: --- Assignee: Wes McKinney (was: Antoine Pitrou) > [Python] pyarrow.read_serialized cannot read concatenated records > - > > Key: ARROW-1996 > URL: https://issues.apache.org/jira/browse/ARROW-1996 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Linux >Reporter: Richard Shin >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The following code > {quote}import pyarrow as pa > f = pa.OSFile('arrow_test', 'w') > pa.serialize_to(12, f) > pa.serialize_to(23, f) > f.close() > f = pa.OSFile('arrow_test', 'r') > print(pa.read_serialized(f).deserialize()) > print(pa.read_serialized(f).deserialize()) > f.close() > {quote} > gives the following result: > {quote}$ python pyarrow_test.py > First: 12 > Traceback (most recent call last): > File "pyarrow_test.py", line 10, in > print('Second: {}'.format(pa.read_serialized(f).deserialize())) > File "pyarrow/serialization.pxi", line 347, in pyarrow.lib.read_serialized > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:79159) > File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status > (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:8270) > pyarrow.lib.ArrowInvalid: Expected schema message in stream, was null or > length 0 > {quote} > I would have expected read_serialized to sucessfully read the second value. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2289: --- Assignee: Kouhei Sutou (was: Wes McKinney) > [GLib] Add Numeric, Integer and FloatingPoint data types > - > > Key: ARROW-2289 > URL: https://issues.apache.org/jira/browse/ARROW-2289 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Affects Versions: 0.8.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps
[ https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-1425: --- Assignee: Li Jin (was: Heimir Thor Sverrisson) > [Python] Document semantic differences between Spark timestamps and Arrow > timestamps > > > Key: ARROW-1425 > URL: https://issues.apache.org/jira/browse/ARROW-1425 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Li Jin >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The way that Spark treats non-timezone-aware timestamps as session local can > be problematic when using pyarrow which may view the data coming from > toPandas() as time zone naive (but with fields as though it were UTC, not > session local). We should document carefully how to properly handle the data > coming from Spark to avoid problems. > cc [~bryanc] [~holdenkarau] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391917#comment-16391917 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173292620 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: Couple of questions: * I see [this language regarding the iteration order](http://en.cppreference.com/w/cpp/container/unordered_multimap) of the values for a particular key in the multimap: > every group of elements whose keys compare equivalent (compare equal with key_eq() as the comparator) forms a contiguous subrange in the iteration order Does the `iteration order` here mean that the values are iterated over in the order in which they were inserted? * Why did you choose to return the first one instead of returning `-1` (or maybe `-2`) for the `std::string` overload? Do we not want to provide a way to indicate that column indexes and column names are not 1:1 in the C++ API? Maybe that already exists. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391924#comment-16391924 ] ASF GitHub Bot commented on ARROW-1974: --- pitrou commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173294153 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: 1) That's a good point. The fact that the container is unordered means it isn't guaranteed to retain insertion order, even for values which map to the same key (I would expect a straightforward implementation to maintain that order, though). I should probably remove the sentence above. 2) Because doing otherwise seems like it could break compatibility. Not sure how strongly you feel about it. The `std::string` overloads aren't used anymore in the parquet-cpp codebase, AFAICT. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1974) [Python] Segfault when working with Arrow tables with duplicate columns
[ https://issues.apache.org/jira/browse/ARROW-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391948#comment-16391948 ] ASF GitHub Bot commented on ARROW-1974: --- cpcloud commented on a change in pull request #447: ARROW-1974: Fix creating Arrow table with duplicate column names URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173300635 ## File path: src/parquet/schema.h ## @@ -264,8 +264,11 @@ class PARQUET_EXPORT GroupNode : public Node { bool Equals(const Node* other) const override; NodePtr field(int i) const { return fields_[i]; } + // Get the index of a field by its name, or negative value if not found + // If several fields share the same name, the smallest index is returned Review comment: > it could break compatibility True, though IIUC wouldn't this potentially segfault if you tried to use the result to index into something? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Segfault when working with Arrow tables with duplicate columns > --- > > Key: ARROW-1974 > URL: https://issues.apache.org/jira/browse/ARROW-1974 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel >Reporter: Alexey Strokach >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I accidentally created a large number of Parquet files with two > __index_level_0__ columns (through a Spark SQL query). > PyArrow can read these files into tables, but it segfaults when converting > the resulting tables to Pandas DataFrames or when saving the tables to > Parquet files. > {code:none} > # Duplicate columns cause segmentation faults > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.to_pandas() # Segmentation fault > pq.write_table(table, '/some/output.parquet') # Segmentation fault > {code} > If I remove the duplicate column using table.remove_column(...) everything > works without segfaults. > {code:none} > # After removing duplicate columns, everything works fine > table = pq.read_table('/path/to/duplicate_column_file.parquet') > table.remove_column(34) > table.to_pandas() # OK > pq.write_table(table, '/some/output.parquet') # OK > {code} > For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` > here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec
Wes McKinney created ARROW-2290: --- Summary: [C++/Python] Add ability to set codec options for lz4 codec Key: ARROW-2290 URL: https://issues.apache.org/jira/browse/ARROW-2290 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Wes McKinney The LZ4 library has many parameters, currently we do not expose these in C++ or Python -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2282) [Python] Create StringArray from buffers
[ https://issues.apache.org/jira/browse/ARROW-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392031#comment-16392031 ] ASF GitHub Bot commented on ARROW-2282: --- wesm commented on issue #1720: ARROW-2282: [Python] Create StringArray from buffers URL: https://github.com/apache/arrow/pull/1720#issuecomment-371648673 rebased This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Create StringArray from buffers > > > Key: ARROW-2282 > URL: https://issues.apache.org/jira/browse/ARROW-2282 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > While we will add a more general-purpose functionality in > https://issues.apache.org/jira/browse/ARROW-2281, the interface is more > complicate then the constructor that explicitly states all arguments: > {{StringArray(int64_t length, const std::shared_ptr& value_offsets, > …}} > Thus I will also expose this explicit constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2290) [C++/Python] Add ability to set codec options for lz4 codec
[ https://issues.apache.org/jira/browse/ARROW-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392101#comment-16392101 ] Lawrence Chan commented on ARROW-2290: -- For what it's worth, this isn't lz4-specific, I just happen to be working with that at the moment. > [C++/Python] Add ability to set codec options for lz4 codec > --- > > Key: ARROW-2290 > URL: https://issues.apache.org/jira/browse/ARROW-2290 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Wes McKinney >Priority: Major > > The LZ4 library has many parameters, currently we do not expose these in C++ > or Python -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/8/18 11:46 PM: -- What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that latter approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124 ] Lawrence Chan commented on ARROW-300: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that latter approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2181) [Python] Add concat_tables to API reference, add documentation on use
[ https://issues.apache.org/jira/browse/ARROW-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned ARROW-2181: --- Assignee: Bryan Cutler > [Python] Add concat_tables to API reference, add documentation on use > - > > Key: ARROW-2181 > URL: https://issues.apache.org/jira/browse/ARROW-2181 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Bryan Cutler >Priority: Major > Fix For: 0.9.0 > > > This omission of documentation was mentioned on the mailing list on February > 13. The documentation should illustrate the contrast between > {{Table.from_batches}} and {{concat_tables}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:00 AM: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. My current workaround uses a fixed length byte array but it's pretty clunky to do this efficiently, at least in the parquet-cpp implementation. There are maybe also some alignment concerns with that approach that I'm just ignoring right now. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392262#comment-16392262 ] Wes McKinney commented on ARROW-300: We haven't done any work on this yet. I think the first step would be to propose additional metadata (in the Flatbuffers files) for record batches to indicate the style of compression. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-300) [Format] Add buffer compression option to IPC file format
[ https://issues.apache.org/jira/browse/ARROW-300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392124#comment-16392124 ] Lawrence Chan edited comment on ARROW-300 at 3/9/18 2:09 AM: - What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. I tried to hack it up with FixedLenByteArray but there are a slew of complications with that, not to mention alignment concerns etc. Anyways I'm happy to help on this, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. was (Author: llchan): What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. Happy to help, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily. > [Format] Add buffer compression option to IPC file format > - > > Key: ARROW-300 > URL: https://issues.apache.org/jira/browse/ARROW-300 > Project: Apache Arrow > Issue Type: New Feature > Components: Format >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > It may be useful if data is to be sent over the wire to compress the data > buffers themselves as their being written in the file layout. > I would propose that we keep this extremely simple with a global buffer > compression setting in the file Footer. Probably only two compressors worth > supporting out of the box would be zlib (higher compression ratios) and lz4 > (better performance). > What does everyone think? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2150) [Python] array equality defaults to identity
[ https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2150: --- Assignee: Wes McKinney > [Python] array equality defaults to identity > > > Key: ARROW-2150 > URL: https://issues.apache.org/jira/browse/ARROW-2150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.9.0 > > > I'm not sure this is deliberate, but it doesn't look very desirable to me: > {code} > >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) > False > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2150) [Python] array equality defaults to identity
[ https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392323#comment-16392323 ] ASF GitHub Bot commented on ARROW-2150: --- wesm opened a new pull request #1729: ARROW-2150: [Python] Raise NotImplementedError when comparing with pyarrow.Array for now URL: https://github.com/apache/arrow/pull/1729 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] array equality defaults to identity > > > Key: ARROW-2150 > URL: https://issues.apache.org/jira/browse/ARROW-2150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure this is deliberate, but it doesn't look very desirable to me: > {code} > >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) > False > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2150) [Python] array equality defaults to identity
[ https://issues.apache.org/jira/browse/ARROW-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2150: -- Labels: pull-request-available (was: ) > [Python] array equality defaults to identity > > > Key: ARROW-2150 > URL: https://issues.apache.org/jira/browse/ARROW-2150 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Antoine Pitrou >Assignee: Wes McKinney >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > > I'm not sure this is deliberate, but it doesn't look very desirable to me: > {code} > >>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32()) > False > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2263: Fix Version/s: (was: 0.10.0) 0.9.0 > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > import pyarrow as pa > ModuleNotFoundError: No module named 'pyarrow' > == 1 failed in 0.23 seconds > === > {code} > I encountered this bit of brittleness in a fresh install where I had not run > {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2263: --- Assignee: Wes McKinney > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > import pyarrow as pa > ModuleNotFoundError: No module named 'pyarrow' > == 1 failed in 0.23 seconds > === > {code} > I encountered this bit of brittleness in a fresh install where I had not run > {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2263: -- Labels: pull-request-available (was: ) > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > import pyarrow as pa > ModuleNotFoundError: No module named 'pyarrow' > == 1 failed in 0.23 seconds > === > {code} > I encountered this bit of brittleness in a fresh install where I had not run > {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392332#comment-16392332 ] ASF GitHub Bot commented on ARROW-2263: --- wesm commented on issue #1730: ARROW-2263: [Python] Prepend local pyarrow/ path to PYTHONPATH in test_cython.py URL: https://github.com/apache/arrow/pull/1730#issuecomment-371700652 This was bugging me -- turned out to be easy to fix. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line
[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392330#comment-16392330 ] ASF GitHub Bot commented on ARROW-2263: --- wesm opened a new pull request #1730: ARROW-2263: [Python] Prepend local pyarrow/ path to PYTHONPATH in test_cython.py URL: https://github.com/apache/arrow/pull/1730 This was bugging me -- turned out to be easy to fix. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > im
[jira] [Assigned] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2268: --- Assignee: Wes McKinney > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2268: -- Labels: pull-request-available (was: ) > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production
[ https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2167: --- Assignee: Wes McKinney > [C++] Building Orc extensions fails with the default > BUILD_WARNING_LEVEL=Production > --- > > Key: ARROW-2167 > URL: https://issues.apache.org/jira/browse/ARROW-2167 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Building orc_ep fails because there are a bunch of upstream warnings like not > providing {{override}} on virtual destructor subclasses, and using {{0}} as > the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is > {{Production}} which includes {{-Wall}} (all warnings as errors). > I see that there are different possible options for {{BUILD_WARNING_LEVEL}} > so it's possible for developers to deal with this issue. > It seems easier to let EPs build with whatever the default warning level is > for the project rather than force our defaults on those projects. > Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2268) Remove MD5 checksums from release process
[ https://issues.apache.org/jira/browse/ARROW-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392338#comment-16392338 ] ASF GitHub Bot commented on ARROW-2268: --- wesm opened a new pull request #1731: ARROW-2268: Drop usage of md5 checksums for source releases, verification scripts URL: https://github.com/apache/arrow/pull/1731 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove MD5 checksums from release process > - > > Key: ARROW-2268 > URL: https://issues.apache.org/jira/browse/ARROW-2268 > Project: Apache Arrow > Issue Type: Bug >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > The ASF has changed its release policy for signatures and checksums to > contraindicate the use of MD5 checksums: > http://www.apache.org/dev/release-distribution#sigs-and-sums. We should > remove this from our various release scripts prior to the 0.9.0 release -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1535) [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system
[ https://issues.apache.org/jira/browse/ARROW-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392341#comment-16392341 ] Wes McKinney commented on ARROW-1535: - [~kou] in theory this should work now, but we should double check that things are still working on master > [Python] Enable sdist source tarballs to build assuming that Arrow C++ > libraries are available on the host system > - > > Key: ARROW-1535 > URL: https://issues.apache.org/jira/browse/ARROW-1535 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: Build, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production
[ https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2167. - Resolution: Won't Fix This seems to be fixed in https://github.com/apache/arrow/pull/1597. Both CHECKIN and PRODUCTION warning levels build fine now we are using the same CMAKE_CXX_FLAGS for EPs -- there are some additional suppressions for ORC. I suggest we deal with this on a case by case basis going forward > [C++] Building Orc extensions fails with the default > BUILD_WARNING_LEVEL=Production > --- > > Key: ARROW-2167 > URL: https://issues.apache.org/jira/browse/ARROW-2167 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Building orc_ep fails because there are a bunch of upstream warnings like not > providing {{override}} on virtual destructor subclasses, and using {{0}} as > the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is > {{Production}} which includes {{-Wall}} (all warnings as errors). > I see that there are different possible options for {{BUILD_WARNING_LEVEL}} > so it's possible for developers to deal with this issue. > It seems easier to let EPs build with whatever the default warning level is > for the project rather than force our defaults on those projects. > Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production
[ https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-2167: - > [C++] Building Orc extensions fails with the default > BUILD_WARNING_LEVEL=Production > --- > > Key: ARROW-2167 > URL: https://issues.apache.org/jira/browse/ARROW-2167 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Building orc_ep fails because there are a bunch of upstream warnings like not > providing {{override}} on virtual destructor subclasses, and using {{0}} as > the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is > {{Production}} which includes {{-Wall}} (all warnings as errors). > I see that there are different possible options for {{BUILD_WARNING_LEVEL}} > so it's possible for developers to deal with this issue. > It seems easier to let EPs build with whatever the default warning level is > for the project rather than force our defaults on those projects. > Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2167) [C++] Building Orc extensions fails with the default BUILD_WARNING_LEVEL=Production
[ https://issues.apache.org/jira/browse/ARROW-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2167. - Resolution: Fixed > [C++] Building Orc extensions fails with the default > BUILD_WARNING_LEVEL=Production > --- > > Key: ARROW-2167 > URL: https://issues.apache.org/jira/browse/ARROW-2167 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.8.0 >Reporter: Phillip Cloud >Assignee: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Building orc_ep fails because there are a bunch of upstream warnings like not > providing {{override}} on virtual destructor subclasses, and using {{0}} as > the {{nullptr}} constant and the default {{BUILD_WARNING_LEVEL}} is > {{Production}} which includes {{-Wall}} (all warnings as errors). > I see that there are different possible options for {{BUILD_WARNING_LEVEL}} > so it's possible for developers to deal with this issue. > It seems easier to let EPs build with whatever the default warning level is > for the project rather than force our defaults on those projects. > Generally speaking, are we using our own CXX_FLAGS for EPs other than Orc? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2291) cpp README missing instructions for libboost-regex-dev
Andy Grove created ARROW-2291: - Summary: cpp README missing instructions for libboost-regex-dev Key: ARROW-2291 URL: https://issues.apache.org/jira/browse/ARROW-2291 Project: Apache Arrow Issue Type: Improvement Components: C++ Environment: Ubuntu 16.04 Reporter: Andy Grove After following the instructions in the README, I could not generate a makefile using CMake because of a missing dependency. The README needs to be updated to include installing libboost-regex-dev. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2291) cpp README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392366#comment-16392366 ] Andy Grove commented on ARROW-2291: --- Here is a PR to update the docs: https://github.com/apache/arrow/pull/1732 > cpp README missing instructions for libboost-regex-dev > -- > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Priority: Trivial > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2291) [C++] README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2291: Summary: [C++] README missing instructions for libboost-regex-dev (was: cpp README missing instructions for libboost-regex-dev) > [C++] README missing instructions for libboost-regex-dev > > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2291) cpp README missing instructions for libboost-regex-dev
[ https://issues.apache.org/jira/browse/ARROW-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2291: --- Assignee: Andy Grove > cpp README missing instructions for libboost-regex-dev > -- > > Key: ARROW-2291 > URL: https://issues.apache.org/jira/browse/ARROW-2291 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Environment: Ubuntu 16.04 >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Trivial > > After following the instructions in the README, I could not generate a > makefile using CMake because of a missing dependency. > The README needs to be updated to include installing libboost-regex-dev. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392372#comment-16392372 ] ASF GitHub Bot commented on ARROW-2263: --- wesm closed pull request #1730: ARROW-2263: [Python] Prepend local pyarrow/ path to PYTHONPATH in test_cython.py URL: https://github.com/apache/arrow/pull/1730 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/pyarrow/tests/test_cython.py b/python/pyarrow/tests/test_cython.py index df5e70ee7..57dbeb554 100644 --- a/python/pyarrow/tests/test_cython.py +++ b/python/pyarrow/tests/test_cython.py @@ -24,6 +24,7 @@ import pyarrow as pa +import pyarrow.tests.util as test_util here = os.path.dirname(os.path.abspath(__file__)) @@ -85,9 +86,14 @@ def test_cython_api(tmpdir): with open('setup.py', 'w') as f: f.write(setup_code) +# ARROW-2263: Make environment with this pyarrow/ package first on the +# PYTHONPATH, for local dev environments +subprocess_env = test_util.get_modified_env_with_pythonpath() + # Compile extension module subprocess.check_call([sys.executable, 'setup.py', - 'build_ext', '--inplace']) + 'build_ext', '--inplace'], + env=subprocess_env) # Check basic functionality orig_path = sys.path[:] diff --git a/python/pyarrow/tests/test_serialization.py b/python/pyarrow/tests/test_serialization.py index c17408457..64aab0671 100644 --- a/python/pyarrow/tests/test_serialization.py +++ b/python/pyarrow/tests/test_serialization.py @@ -28,6 +28,8 @@ import pyarrow as pa import numpy as np +import pyarrow.tests.util as test_util + try: import torch except ImportError: @@ -624,18 +626,6 @@ def deserialize_regex(serialized, q): p.join() -def _get_modified_env_with_pythonpath(): -# Prepend pyarrow root directory to PYTHONPATH -env = os.environ.copy() -existing_pythonpath = env.get('PYTHONPATH', '') - -module_path = os.path.abspath( -os.path.dirname(os.path.dirname(pa.__file__))) - -env['PYTHONPATH'] = os.pathsep.join((module_path, existing_pythonpath)) -return env - - def test_deserialize_buffer_in_different_process(): import tempfile import subprocess @@ -645,7 +635,7 @@ def test_deserialize_buffer_in_different_process(): f.write(b.to_pybytes()) f.close() -subprocess_env = _get_modified_env_with_pythonpath() +subprocess_env = test_util.get_modified_env_with_pythonpath() dir_path = os.path.dirname(os.path.realpath(__file__)) python_file = os.path.join(dir_path, 'deserialize_buffer.py') diff --git a/python/pyarrow/tests/util.py b/python/pyarrow/tests/util.py index a3ba9000c..8c8d23b0c 100644 --- a/python/pyarrow/tests/util.py +++ b/python/pyarrow/tests/util.py @@ -19,9 +19,12 @@ Utility functions for testing """ +import contextlib import decimal +import os import random -import contextlib + +import pyarrow as pa def randsign(): @@ -91,3 +94,15 @@ def randdecimal(precision, scale): return decimal.Decimal( '{}.{}'.format(whole, str(fractional).rjust(scale, '0')) ) + + +def get_modified_env_with_pythonpath(): +# Prepend pyarrow root directory to PYTHONPATH +env = os.environ.copy() +existing_pythonpath = env.get('PYTHONPATH', '') + +module_path = os.path.abspath( +os.path.dirname(os.path.dirname(pa.__file__))) + +env['PYTHONPATH'] = os.pathsep.join((module_path, existing_pythonpath)) +return env This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-
[jira] [Resolved] (ARROW-2263) [Python] test_cython.py fails if pyarrow is not in import path (e.g. with inplace builds)
[ https://issues.apache.org/jira/browse/ARROW-2263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2263. - Resolution: Fixed Issue resolved by pull request 1730 [https://github.com/apache/arrow/pull/1730] > [Python] test_cython.py fails if pyarrow is not in import path (e.g. with > inplace builds) > - > > Key: ARROW-2263 > URL: https://issues.apache.org/jira/browse/ARROW-2263 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > see > {code} > $ py.test pyarrow/tests/test_cython.py > = test session starts > = > platform linux -- Python 3.6.4, pytest-3.4.1, py-1.5.2, pluggy-0.6.0 > rootdir: /home/wesm/code/arrow/python, inifile: setup.cfg > collected 1 item > > pyarrow/tests/test_cython.py F > [100%] > == FAILURES > === > ___ test_cython_api > ___ > tmpdir = local('/tmp/pytest-of-wesm/pytest-3/test_cython_api0') > @pytest.mark.skipif( > 'ARROW_HOME' not in os.environ, > reason='ARROW_HOME environment variable not defined') > def test_cython_api(tmpdir): > """ > Basic test for the Cython API. > """ > pytest.importorskip('Cython') > > ld_path_default = os.path.join(os.environ['ARROW_HOME'], 'lib') > > test_ld_path = os.environ.get('PYARROW_TEST_LD_PATH', ld_path_default) > > with tmpdir.as_cwd(): > # Set up temporary workspace > pyx_file = 'pyarrow_cython_example.pyx' > shutil.copyfile(os.path.join(here, pyx_file), > os.path.join(str(tmpdir), pyx_file)) > # Create setup.py file > if os.name == 'posix': > compiler_opts = ['-std=c++11'] > else: > compiler_opts = [] > setup_code = setup_template.format(pyx_file=pyx_file, >compiler_opts=compiler_opts, >test_ld_path=test_ld_path) > with open('setup.py', 'w') as f: > f.write(setup_code) > > # Compile extension module > subprocess.check_call([sys.executable, 'setup.py', > > 'build_ext', '--inplace']) > pyarrow/tests/test_cython.py:90: > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > _ _ _ _ _ _ _ _ _ > popenargs = (['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'],) > kwargs = {}, retcode = 1 > cmd = ['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', > 'build_ext', '--inplace'] > def check_call(*popenargs, **kwargs): > """Run command with arguments. Wait for command to complete. If > the exit code was zero then return, otherwise raise > CalledProcessError. The CalledProcessError object will have the > return code in the returncode attribute. > > The arguments are the same as for the call function. Example: > > check_call(["ls", "-l"]) > """ > retcode = call(*popenargs, **kwargs) > if retcode: > cmd = kwargs.get("args") > if cmd is None: > cmd = popenargs[0] > > raise CalledProcessError(retcode, cmd) > E subprocess.CalledProcessError: Command > '['/home/wesm/miniconda/envs/arrow-dev/bin/python', 'setup.py', 'build_ext', > '--inplace']' returned non-zero exit status 1. > ../../../miniconda/envs/arrow-dev/lib/python3.6/subprocess.py:291: > CalledProcessError > Captured stderr call > - > Traceback (most recent call last): > File "setup.py", line 7, in > import pyarrow as pa > ModuleNotFoundError: No module named 'pyarrow' > == 1 failed in 0.23 seconds > === > {code} > I encountered this bit of brittleness in a fresh install where I had not run > {{setup.py develop}} nor {{setup.py install}} on my local pyarrow dev area -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
[ https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-1940. - Resolution: Fixed Issue resolved by pull request 1728 [https://github.com/apache/arrow/pull/1728] > [Python] Extra metadata gets added after multiple conversions between > pd.DataFrame and pa.Table > --- > > Key: ARROW-1940 > URL: https://issues.apache.org/jira/browse/ARROW-1940 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Dima Ryazanov >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > Attachments: fail.py > > > We have a unit test that verifies that loading a dataframe from a .parq file > and saving it back with no changes produces the same result as the original > file. It started failing with pyarrow 0.8.0. > After digging into it, I discovered that after the first conversion from > pd.DataFrame to pa.Table, the table contains the following metadata (among > other things): > {code} > "column_indexes": [{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}] > {code} > However, after converting it to pd.DataFrame and back into a pa.Table for the > second time, the metadata gets an encoding field: > {code} > "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, > "name": null, "numpy_type": "object", "pandas_type": "unicode"}] > {code} > See the attached file for a test case. > So specifically, it appears that dataframe->table->dataframe->table > conversion produces a different result from just dataframe->table - which I > think is unexpected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
[ https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392375#comment-16392375 ] ASF GitHub Bot commented on ARROW-1940: --- wesm commented on a change in pull request #1728: ARROW-1940: [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table URL: https://github.com/apache/arrow/pull/1728#discussion_r173361703 ## File path: cpp/src/arrow/python/helpers.cc ## @@ -116,7 +116,8 @@ static Status InferDecimalPrecisionAndScale(PyObject* python_decimal, int32_t* p DCHECK_NE(scale, NULLPTR); // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a DCHECK - OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "")); + OwnedRef as_tuple(PyObject_CallMethod(python_decimal, const_cast("as_tuple"), +const_cast(""))); Review comment: see also the `cpp_PyObject_CallMethod` wrapper for this issue in io.cc This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Extra metadata gets added after multiple conversions between > pd.DataFrame and pa.Table > --- > > Key: ARROW-1940 > URL: https://issues.apache.org/jira/browse/ARROW-1940 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Dima Ryazanov >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > Attachments: fail.py > > > We have a unit test that verifies that loading a dataframe from a .parq file > and saving it back with no changes produces the same result as the original > file. It started failing with pyarrow 0.8.0. > After digging into it, I discovered that after the first conversion from > pd.DataFrame to pa.Table, the table contains the following metadata (among > other things): > {code} > "column_indexes": [{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}] > {code} > However, after converting it to pd.DataFrame and back into a pa.Table for the > second time, the metadata gets an encoding field: > {code} > "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, > "name": null, "numpy_type": "object", "pandas_type": "unicode"}] > {code} > See the attached file for a test case. > So specifically, it appears that dataframe->table->dataframe->table > conversion produces a different result from just dataframe->table - which I > think is unexpected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
[ https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392377#comment-16392377 ] ASF GitHub Bot commented on ARROW-1940: --- wesm closed pull request #1728: ARROW-1940: [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table URL: https://github.com/apache/arrow/pull/1728 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index 429068dd1..13dcc4661 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -116,7 +116,8 @@ static Status InferDecimalPrecisionAndScale(PyObject* python_decimal, int32_t* p DCHECK_NE(scale, NULLPTR); // TODO(phillipc): Make sure we perform PyDecimal_Check(python_decimal) as a DCHECK - OwnedRef as_tuple(PyObject_CallMethod(python_decimal, "as_tuple", "")); + OwnedRef as_tuple(PyObject_CallMethod(python_decimal, const_cast("as_tuple"), +const_cast(""))); RETURN_IF_PYERROR(); DCHECK(PyTuple_Check(as_tuple.obj())); @@ -241,7 +242,8 @@ bool PyDecimal_Check(PyObject* obj) { bool PyDecimal_ISNAN(PyObject* obj) { DCHECK(PyDecimal_Check(obj)) << "obj is not an instance of decimal.Decimal"; - OwnedRef is_nan(PyObject_CallMethod(obj, "is_nan", "")); + OwnedRef is_nan( + PyObject_CallMethod(obj, const_cast("is_nan"), const_cast(""))); return PyObject_IsTrue(is_nan.obj()) == 1; } diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index 0bc47fc0d..97ea51d7e 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -18,6 +18,7 @@ import ast import collections import json +import operator import re import pandas.core.internals as _int @@ -99,8 +100,8 @@ def get_logical_type(arrow_type): np.float32: 'float32', np.float64: 'float64', 'datetime64[D]': 'date', -np.str_: 'unicode', -np.bytes_: 'bytes', +np.unicode_: 'string' if not PY2 else 'unicode', +np.bytes_: 'bytes' if not PY2 else 'string', } @@ -615,6 +616,22 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1, def _backwards_compatible_index_name(raw_name, logical_name): +"""Compute the name of an index column that is compatible with older +versions of :mod:`pyarrow`. + +Parameters +-- +raw_name : str +logical_name : str + +Returns +--- +result : str + +Notes +- +* Part of :func:`~pyarrow.pandas_compat.table_to_blockmanager` +""" # Part of table_to_blockmanager pattern = r'^__index_level_\d+__$' if raw_name == logical_name and re.match(pattern, raw_name) is not None: @@ -623,8 +640,57 @@ def _backwards_compatible_index_name(raw_name, logical_name): return logical_name +_pandas_logical_type_map = { +'date': 'datetime64[D]', +'unicode': np.unicode_, +'bytes': np.bytes_, +'string': np.str_, +'empty': np.object_, +'mixed': np.object_, +} + + +def _pandas_type_to_numpy_type(pandas_type): +"""Get the numpy dtype that corresponds to a pandas type. + +Parameters +-- +pandas_type : str +The result of a call to pandas.lib.infer_dtype. + +Returns +--- +dtype : np.dtype +The dtype that corresponds to `pandas_type`. +""" +try: +return _pandas_logical_type_map[pandas_type] +except KeyError: +return np.dtype(pandas_type) + + def _reconstruct_columns_from_metadata(columns, column_indexes): -# Part of table_to_blockmanager +"""Construct a pandas MultiIndex from `columns` and column index metadata +in `column_indexes`. + +Parameters +-- +columns : List[pd.Index] +The columns coming from a pyarrow.Table +column_indexes : List[Dict[str, str]] +The column index metadata deserialized from the JSON schema metadata +in a :class:`~pyarrow.Table`. + +Returns +--- +result : MultiIndex +The index reconstructed using `column_indexes` metadata with levels of +the correct type. + +Notes +- +* Part of :func:`~pyarrow.pandas_compat.table_to_blockmanager` +""" # Get levels and labels, and provide sane defaults if the index has a # single level to avoid if/else spaghetti. @@ -635,21 +701,28 @@ def _reconstruct_columns_from_metadata(columns, column_indexes): # Convert each level to the dtype provided in the metadata levels_dtypes = [ -(level, col_index.get('numpy_type', level.dtype)) +(level, col_index.get('pandas_type', str(level.
[jira] [Updated] (ARROW-1940) [Python] Extra metadata gets added after multiple conversions between pd.DataFrame and pa.Table
[ https://issues.apache.org/jira/browse/ARROW-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1940: -- Labels: pull-request-available (was: ) > [Python] Extra metadata gets added after multiple conversions between > pd.DataFrame and pa.Table > --- > > Key: ARROW-1940 > URL: https://issues.apache.org/jira/browse/ARROW-1940 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Dima Ryazanov >Assignee: Phillip Cloud >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > Attachments: fail.py > > > We have a unit test that verifies that loading a dataframe from a .parq file > and saving it back with no changes produces the same result as the original > file. It started failing with pyarrow 0.8.0. > After digging into it, I discovered that after the first conversion from > pd.DataFrame to pa.Table, the table contains the following metadata (among > other things): > {code} > "column_indexes": [{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}] > {code} > However, after converting it to pd.DataFrame and back into a pa.Table for the > second time, the metadata gets an encoding field: > {code} > "column_indexes": [{"metadata": {"encoding": "UTF-8"}, "field_name": null, > "name": null, "numpy_type": "object", "pandas_type": "unicode"}] > {code} > See the attached file for a test case. > So specifically, it appears that dataframe->table->dataframe->table > conversion produces a different result from just dataframe->table - which I > think is unexpected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2289. - Resolution: Fixed Issue resolved by pull request 1726 [https://github.com/apache/arrow/pull/1726] > [GLib] Add Numeric, Integer and FloatingPoint data types > - > > Key: ARROW-2289 > URL: https://issues.apache.org/jira/browse/ARROW-2289 > Project: Apache Arrow > Issue Type: Improvement > Components: GLib >Affects Versions: 0.8.0 >Reporter: Kouhei Sutou >Assignee: Kouhei Sutou >Priority: Minor > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2289) [GLib] Add Numeric, Integer and FloatingPoint data types
[ https://issues.apache.org/jira/browse/ARROW-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392386#comment-16392386 ] ASF GitHub Bot commented on ARROW-2289: --- wesm closed pull request #1726: ARROW-2289: [GLib] Add Numeric, Integer, FloatingPoint data types URL: https://github.com/apache/arrow/pull/1726 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/c_glib/arrow-glib/basic-data-type.cpp b/c_glib/arrow-glib/basic-data-type.cpp index a5f7aed1b..82abfa35c 100644 --- a/c_glib/arrow-glib/basic-data-type.cpp +++ b/c_glib/arrow-glib/basic-data-type.cpp @@ -315,9 +315,39 @@ garrow_boolean_data_type_new(void) } +G_DEFINE_ABSTRACT_TYPE(GArrowNumericDataType,\ + garrow_numeric_data_type, \ + GARROW_TYPE_FIXED_WIDTH_DATA_TYPE) + +static void +garrow_numeric_data_type_init(GArrowNumericDataType *object) +{ +} + +static void +garrow_numeric_data_type_class_init(GArrowNumericDataTypeClass *klass) +{ +} + + +G_DEFINE_ABSTRACT_TYPE(GArrowIntegerDataType,\ + garrow_integer_data_type, \ + GARROW_TYPE_NUMERIC_DATA_TYPE) + +static void +garrow_integer_data_type_init(GArrowIntegerDataType *object) +{ +} + +static void +garrow_integer_data_type_class_init(GArrowIntegerDataTypeClass *klass) +{ +} + + G_DEFINE_TYPE(GArrowInt8DataType,\ garrow_int8_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_int8_data_type_init(GArrowInt8DataType *object) @@ -349,7 +379,7 @@ garrow_int8_data_type_new(void) G_DEFINE_TYPE(GArrowUInt8DataType,\ garrow_uint8_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_uint8_data_type_init(GArrowUInt8DataType *object) @@ -381,7 +411,7 @@ garrow_uint8_data_type_new(void) G_DEFINE_TYPE(GArrowInt16DataType,\ garrow_int16_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_int16_data_type_init(GArrowInt16DataType *object) @@ -413,7 +443,7 @@ garrow_int16_data_type_new(void) G_DEFINE_TYPE(GArrowUInt16DataType,\ garrow_uint16_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_uint16_data_type_init(GArrowUInt16DataType *object) @@ -445,7 +475,7 @@ garrow_uint16_data_type_new(void) G_DEFINE_TYPE(GArrowInt32DataType,\ garrow_int32_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_int32_data_type_init(GArrowInt32DataType *object) @@ -477,7 +507,7 @@ garrow_int32_data_type_new(void) G_DEFINE_TYPE(GArrowUInt32DataType,\ garrow_uint32_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_uint32_data_type_init(GArrowUInt32DataType *object) @@ -509,7 +539,7 @@ garrow_uint32_data_type_new(void) G_DEFINE_TYPE(GArrowInt64DataType,\ garrow_int64_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_int64_data_type_init(GArrowInt64DataType *object) @@ -541,7 +571,7 @@ garrow_int64_data_type_new(void) G_DEFINE_TYPE(GArrowUInt64DataType,\ garrow_uint64_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_INTEGER_DATA_TYPE) static void garrow_uint64_data_type_init(GArrowUInt64DataType *object) @@ -571,9 +601,24 @@ garrow_uint64_data_type_new(void) } +G_DEFINE_ABSTRACT_TYPE(GArrowFloatingPointDataType,\ + garrow_floating_point_data_type,\ + GARROW_TYPE_NUMERIC_DATA_TYPE) + +static void +garrow_floating_point_data_type_init(GArrowFloatingPointDataType *object) +{ +} + +static void +garrow_floating_point_data_type_class_init(GArrowFloatingPointDataTypeClass *klass) +{ +} + + G_DEFINE_TYPE(GArrowFloatDataType,\ garrow_float_data_type, \ - GARROW_TYPE_DATA_TYPE) + GARROW_TYPE_FLOATING_POINT_DATA_TYPE) static void garrow_float_data_type_init(GArrowFloatDataType *object) @@ -605,7 +650,7 @@ garrow_float_data_type_new(void) G
[jira] [Created] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer
Wes McKinney created ARROW-2292: --- Summary: [Python] More consistent / intuitive name for pyarrow.frombuffer Key: ARROW-2292 URL: https://issues.apache.org/jira/browse/ARROW-2292 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.9.0 Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could call {{from_buffer}} something like {{py_buffer}} instead? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer
[ https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2292: Description: Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could call {{frombuffer}} something like {{py_buffer}} instead? (was: Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could call {{from_buffer}} something like {{py_buffer}} instead?) > [Python] More consistent / intuitive name for pyarrow.frombuffer > > > Key: ARROW-2292 > URL: https://issues.apache.org/jira/browse/ARROW-2292 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could > call {{frombuffer}} something like {{py_buffer}} instead? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2292) [Python] More consistent / intuitive name for pyarrow.frombuffer
[ https://issues.apache.org/jira/browse/ARROW-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392389#comment-16392389 ] Wes McKinney commented on ARROW-2292: - cc [~pitrou] > [Python] More consistent / intuitive name for pyarrow.frombuffer > > > Key: ARROW-2292 > URL: https://issues.apache.org/jira/browse/ARROW-2292 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.9.0 > > > Now that we have {{pyarrow.foreign_buffer}}, things are a bit odd. We could > call {{frombuffer}} something like {{py_buffer}} instead? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime
[ https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392395#comment-16392395 ] ASF GitHub Bot commented on ARROW-2270: --- wesm closed pull request #1714: ARROW-2270: [Python] Fix lifetime of ForeignBuffer base object URL: https://github.com/apache/arrow/pull/1714 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc index 801a32574..36c193dbf 100644 --- a/cpp/src/arrow/python/io.cc +++ b/cpp/src/arrow/python/io.cc @@ -216,5 +216,19 @@ Status PyOutputStream::Write(const void* data, int64_t nbytes) { return file_->Write(data, nbytes); } +// -- +// Foreign buffer + +Status PyForeignBuffer::Make(const uint8_t* data, int64_t size, PyObject* base, + std::shared_ptr* out) { + PyForeignBuffer* buf = new PyForeignBuffer(data, size, base); + if (buf == NULL) { +return Status::OutOfMemory("could not allocate foreign buffer object"); + } else { +*out = std::shared_ptr(buf); +return Status::OK(); + } +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/io.h b/cpp/src/arrow/python/io.h index 696055610..5c76fe9fe 100644 --- a/cpp/src/arrow/python/io.h +++ b/cpp/src/arrow/python/io.h @@ -81,6 +81,27 @@ class ARROW_EXPORT PyOutputStream : public io::OutputStream { // TODO(wesm): seekable output files +// A Buffer subclass that keeps a PyObject reference throughout its +// lifetime, such that the Python object is kept alive as long as the +// C++ buffer is still needed. +// Keeping the reference in a Python wrapper would be incorrect as +// the Python wrapper can get destroyed even though the wrapped C++ +// buffer is still alive (ARROW-2270). +class ARROW_EXPORT PyForeignBuffer : public Buffer { + public: + static Status Make(const uint8_t* data, int64_t size, PyObject* base, + std::shared_ptr* out); + + private: + PyForeignBuffer(const uint8_t* data, int64_t size, PyObject* base) + : Buffer(data, size) { +Py_INCREF(base); +base_.reset(base); + } + + OwnedRefNoGIL base_; +}; + } // namespace py } // namespace arrow diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index a71e92b0b..3db1a04b6 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -186,6 +186,7 @@ Tables and Record Batches column chunked_array + concat_tables ChunkedArray Column RecordBatch @@ -213,6 +214,7 @@ Input / Output and Shared Memory compress decompress frombuffer + foreign_buffer Buffer ResizableBuffer BufferReader diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 28ac98ea0..225dfd0b2 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -86,7 +86,7 @@ def parse_version(root): from pyarrow.lib import TimestampType # Buffers, allocation -from pyarrow.lib import (Buffer, ForeignBuffer, ResizableBuffer, compress, +from pyarrow.lib import (Buffer, ResizableBuffer, foreign_buffer, compress, decompress, allocate_buffer, frombuffer) from pyarrow.lib import (MemoryPool, total_allocated_bytes, diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 456fcca36..22c39a865 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -904,6 +904,11 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: @staticmethod CStatus FromPyObject(object obj, shared_ptr[CBuffer]* out) +cdef cppclass PyForeignBuffer(CBuffer): +@staticmethod +CStatus Make(const uint8_t* data, int64_t size, object base, + shared_ptr[CBuffer]* out) + cdef cppclass PyReadableFile(RandomAccessFile): PyReadableFile(object fo) diff --git a/python/pyarrow/io.pxi b/python/pyarrow/io.pxi index 611c8a86d..15ecd0164 100644 --- a/python/pyarrow/io.pxi +++ b/python/pyarrow/io.pxi @@ -726,18 +726,6 @@ cdef class Buffer: return self.size -cdef class ForeignBuffer(Buffer): - -def __init__(self, addr, size, base): -cdef: -intptr_t c_addr = addr -int64_t c_size = size -self.base = base -cdef shared_ptr[CBuffer] buffer = make_shared[CBuffer]( -c_addr, c_size) -self.init( buffer) - - cdef class ResizableBuffer(Buffer): cdef void init_rz(self, const shared_ptr[CResizableBuffer]& buffer): @@ -861,6 +849,21 @@ def frombuffer(object obj): return pyarrow_wrap_buffer(buf) +def for
[jira] [Resolved] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime
[ https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-2270. - Resolution: Fixed Fix Version/s: 0.9.0 Issue resolved by pull request 1714 [https://github.com/apache/arrow/pull/1714] > [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer > lifetime > > > Key: ARROW-2270 > URL: https://issues.apache.org/jira/browse/ARROW-2270 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{ForeignBuffer}} keeps the reference to the Python base object in the Python > wrapper class, not in the C++ buffer instance, meaning if the C++ buffer gets > passed around but the Python wrapper gets destroyed, the reference to the > original Python base object will be released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2270) [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer lifetime
[ https://issues.apache.org/jira/browse/ARROW-2270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392393#comment-16392393 ] ASF GitHub Bot commented on ARROW-2270: --- wesm commented on issue #1714: ARROW-2270: [Python] Fix lifetime of ForeignBuffer base object URL: https://github.com/apache/arrow/pull/1714#issuecomment-371710696 I added this new function to the API documentation. Merging This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] ForeignBuffer doesn't tie Python object lifetime to C++ buffer > lifetime > > > Key: ARROW-2270 > URL: https://issues.apache.org/jira/browse/ARROW-2270 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > {{ForeignBuffer}} keeps the reference to the Python base object in the Python > wrapper class, not in the C++ buffer instance, meaning if the C++ buffer gets > passed around but the Python wrapper gets destroyed, the reference to the > original Python base object will be released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1535) [Python] Enable sdist source tarballs to build assuming that Arrow C++ libraries are available on the host system
[ https://issues.apache.org/jira/browse/ARROW-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392403#comment-16392403 ] Kouhei Sutou commented on ARROW-1535: - I've confirmed that this works well on master: % python3 setup.py sdist % pip3 install dist/pyarrow-*.tar.gz % python3 -c 'import pyarrow' > [Python] Enable sdist source tarballs to build assuming that Arrow C++ > libraries are available on the host system > - > > Key: ARROW-1535 > URL: https://issues.apache.org/jira/browse/ARROW-1535 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: Build, pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)