[jira] [Updated] (ARROW-7102) Make filesystem wrappers compatible with fsspec
[ https://issues.apache.org/jira/browse/ARROW-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-7102: -- Issue Type: Improvement (was: Bug) > Make filesystem wrappers compatible with fsspec > --- > > Key: ARROW-7102 > URL: https://issues.apache.org/jira/browse/ARROW-7102 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Tom Augspurger >Priority: Major > Labels: FileSystem > > [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a > common API for a variety filesystem implementations. I'm proposing a > FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec > implementation. > > Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific > to s3fs. > [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. > This implementation could be removed entirely once an FSSPecWrapper is done, > or kept as an alias if it's part of the public API. > > This is realted to ARROW-3717, which requested a GCSFSWrapper for working > with google cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7102) Make filesystem wrappers compatible with fsspec
Tom Augspurger created ARROW-7102: - Summary: Make filesystem wrappers compatible with fsspec Key: ARROW-7102 URL: https://issues.apache.org/jira/browse/ARROW-7102 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Tom Augspurger [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a common API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec implementation. Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to s3fs. [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. This implementation could be removed entirely once an FSSPecWrapper is done, or kept as an alias if it's part of the public API. This is realted to ARROW-3717, which requested a GCSFSWrapper for working with google cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7102) Make filesystem wrappers compatible with fsspec
[ https://issues.apache.org/jira/browse/ARROW-7102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-7102: -- Description: [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec implementation. Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to s3fs. [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. This implementation could be removed entirely once an FSSPecWrapper is done, or kept as an alias if it's part of the public API. This is realted to ARROW-3717, which requested a GCSFSWrapper for working with google cloud storage. was: [fsspec|fsspec: https://filesystem-spec.readthedocs.io/en/latest/] defines a common API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, similar to S3FSWrapper, that works with any fsspec implementation. Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific to s3fs. [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. This implementation could be removed entirely once an FSSPecWrapper is done, or kept as an alias if it's part of the public API. This is realted to ARROW-3717, which requested a GCSFSWrapper for working with google cloud storage. > Make filesystem wrappers compatible with fsspec > --- > > Key: ARROW-7102 > URL: https://issues.apache.org/jira/browse/ARROW-7102 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Tom Augspurger >Priority: Major > Labels: FileSystem > > [fsspec|https://filesystem-spec.readthedocs.io/en/latest] defines a common > API for a variety filesystem implementations. I'm proposing a FSSpecWrapper, > similar to S3FSWrapper, that works with any fsspec implementation. > > Right now, pyarrow has a pyarrow.filesystems.S3FSWrapper, which is specific > to s3fs. > [https://github.com/apache/arrow/blob/21ad7ac1162eab188a1e15923fb1de5b795337ec/python/pyarrow/filesystem.py#L320]. > This implementation could be removed entirely once an FSSPecWrapper is done, > or kept as an alias if it's part of the public API. > > This is realted to ARROW-3717, which requested a GCSFSWrapper for working > with google cloud storage. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals
Tom Augspurger created ARROW-1897: - Summary: Incorrect numpy_type for pandas metadata of Categoricals Key: ARROW-1897 URL: https://issues.apache.org/jira/browse/ARROW-1897 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.8.0 Reporter: Tom Augspurger Fix For: 0.9.0 If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {{ In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} }} >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals
[ https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-1897: -- Description: If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {code In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} } >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` was: If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {{ In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} }} >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` > Incorrect numpy_type for pandas metadata of Categoricals > > > Key: ARROW-1897 > URL: https://issues.apache.org/jira/browse/ARROW-1897 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Tom Augspurger > Labels: categorical, metadata, pandas, parquet, pyarrow > Fix For: 0.9.0 > > > If I'm reading > http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format > correctly, the "numpy_type" field of a `Categorical` should be the storage > type used for the *codes*. It looks like pyarrow is just using 'object' > always. > {code > In [1]: import pandas as pd > In [2]: import pyarrow as pa > In [3]: import pyarrow.parquet as pq > In [4]: import io > In [5]: import json > In [6]: df = pd.DataFrame({"A": [1, 2]}, >...: index=pd.CategoricalIndex(['one', 'two'], > name='idx')) >...: > In [8]: sink = io.BytesIO() >...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) >...: > json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] >...: > Out[8]: > {'field_name': '__index_level_0__', > 'metadata': {'num_categories': 2, 'ordered': False}, > 'name': 'idx', > 'numpy_type': 'object', > 'pandas_type': 'categorical'} > } > From the spec: > > The numpy_type is the physical storage type of the column, which is the > > result of str(dtype) for the underlying NumPy array that holds the data. So > > for datetimetz this is datetime64[ns] and for categorical, it may be any of > > the supported integer categorical types. > So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1897) Incorrect numpy_type for pandas metadata of Categoricals
[ https://issues.apache.org/jira/browse/ARROW-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-1897: -- Description: If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {code} In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} {code} >From the spec: bq. The numpy_type is the physical storage type of the column, which is the result of str(dtype) for the underlying NumPy array that holds the data. So for datetimetz this is datetime64[ns] and for categorical, it may be any of the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` was: If I'm reading http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format correctly, the "numpy_type" field of a `Categorical` should be the storage type used for the *codes*. It looks like pyarrow is just using 'object' always. {code In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: import pyarrow.parquet as pq In [4]: import io In [5]: import json In [6]: df = pd.DataFrame({"A": [1, 2]}, ...: index=pd.CategoricalIndex(['one', 'two'], name='idx')) ...: In [8]: sink = io.BytesIO() ...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) ...: json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] ...: Out[8]: {'field_name': '__index_level_0__', 'metadata': {'num_categories': 2, 'ordered': False}, 'name': 'idx', 'numpy_type': 'object', 'pandas_type': 'categorical'} } >From the spec: > The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. So 'numpy_type' field should be something like `'int8'` instead of `'object'` > Incorrect numpy_type for pandas metadata of Categoricals > > > Key: ARROW-1897 > URL: https://issues.apache.org/jira/browse/ARROW-1897 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 >Reporter: Tom Augspurger > Labels: categorical, metadata, pandas, parquet, pyarrow > Fix For: 0.9.0 > > > If I'm reading > http://pandas-docs.github.io/pandas-docs-travis/developer.html#storing-pandas-dataframe-objects-in-apache-parquet-format > correctly, the "numpy_type" field of a `Categorical` should be the storage > type used for the *codes*. It looks like pyarrow is just using 'object' > always. > {code} > In [1]: import pandas as pd > In [2]: import pyarrow as pa > In [3]: import pyarrow.parquet as pq > In [4]: import io > In [5]: import json > In [6]: df = pd.DataFrame({"A": [1, 2]}, >...: index=pd.CategoricalIndex(['one', 'two'], > name='idx')) >...: > In [8]: sink = io.BytesIO() >...: pq.write_metadata(pa.Table.from_pandas(df).schema, sink) >...: > json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['columns'][-1] >...: > Out[8]: > {'field_name': '__index_level_0__', > 'metadata': {'num_categories': 2, 'ordered': False}, > 'name': 'idx', > 'numpy_type': 'object', > 'pandas_type': 'categorical'} > {code} > From the spec: > bq. The numpy_type is the physical storage type of the column, which is the > result of str(dtype) for the underlying NumPy array that holds the data. So > for datetimetz this is datetime64[ns] and for categorical, it may be any of > the supported integer categorical types. > So 'numpy_type' field should be something like `'int8'` instead of `'object'` -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1580) [Python] Instructions for setting up nightly builds on Linux
[ https://issues.apache.org/jira/browse/ARROW-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16327817#comment-16327817 ] Tom Augspurger commented on ARROW-1580: --- The short version is # We have a build machine running Ubuntu # Phil has a build tool [https://github.com/cpcloud/scourge] that's been setup on the build machine # That tool is triggered through an airflow scheduler. Airflow (and some PyData ASVs) are bootstrapped through [https://github.com/tomaugspurger/asv-runner]. The extra step for the arrow nightlies was to add a file in ~/airflow/dags so that the scheduler picked it up. > [Python] Instructions for setting up nightly builds on Linux > > > Key: ARROW-1580 > URL: https://issues.apache.org/jira/browse/ARROW-1580 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.7.1 >Reporter: Wes McKinney >Assignee: Phillip Cloud >Priority: Major > Fix For: 0.9.0 > > > cc [~cpcloud] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-1557) pyarrow.Table.from_arrays doesn't validate names length
Tom Augspurger created ARROW-1557: - Summary: pyarrow.Table.from_arrays doesn't validate names length Key: ARROW-1557 URL: https://issues.apache.org/jira/browse/ARROW-1557 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor pa.Table.from_arrays doesn't validate that the length of {{arrays}} and {{names}} matches. I think this should raise with a {{ValueError}}: {{ In [1]: import pyarrow as pa In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], names=['a', 'b', 'c']) Out[2]: pyarrow.Table a: int64 b: int64 In [3]: pa.__version__ Out[3]: '0.7.0' }} (This is my first time using JIRA, hopefully I didn't mess up too badly) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
[ https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-1557: -- Summary: [PYTHON] pyarrow.Table.from_arrays doesn't validate names length (was: pyarrow.Table.from_arrays doesn't validate names length) > [PYTHON] pyarrow.Table.from_arrays doesn't validate names length > > > Key: ARROW-1557 > URL: https://issues.apache.org/jira/browse/ARROW-1557 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.0 >Reporter: Tom Augspurger >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > pa.Table.from_arrays doesn't validate that the length of {{arrays}} and > {{names}} matches. I think this should raise with a {{ValueError}}: > {{ > In [1]: import pyarrow as pa > In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], > names=['a', 'b', 'c']) > Out[2]: > pyarrow.Table > a: int64 > b: int64 > In [3]: pa.__version__ > Out[3]: '0.7.0' > }} > (This is my first time using JIRA, hopefully I didn't mess up too badly) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1557) [PYTHON] pyarrow.Table.from_arrays doesn't validate names length
[ https://issues.apache.org/jira/browse/ARROW-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172007#comment-16172007 ] Tom Augspurger commented on ARROW-1557: --- I can probably submit a fix on Thursday or Friday. > [PYTHON] pyarrow.Table.from_arrays doesn't validate names length > > > Key: ARROW-1557 > URL: https://issues.apache.org/jira/browse/ARROW-1557 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.0 >Reporter: Tom Augspurger >Priority: Minor > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > pa.Table.from_arrays doesn't validate that the length of {{arrays}} and > {{names}} matches. I think this should raise with a {{ValueError}}: > {{ > In [1]: import pyarrow as pa > In [2]: pa.Table.from_arrays([pa.array([1, 2]), pa.array([3, 4])], > names=['a', 'b', 'c']) > Out[2]: > pyarrow.Table > a: int64 > b: int64 > In [3]: pa.__version__ > Out[3]: '0.7.0' > }} > (This is my first time using JIRA, hopefully I didn't mess up too badly) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1584) [PYTHON] serialize_pandas on empty dataframe
Tom Augspurger created ARROW-1584: - Summary: [PYTHON] serialize_pandas on empty dataframe Key: ARROW-1584 URL: https://issues.apache.org/jira/browse/ARROW-1584 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 This code {code:python} import pandas as pd import pyarrow as pa pa.serialize_pandas(pd.DataFrame()) {code} Raises {code} --- ArrowNotImplementedError Traceback (most recent call last) in () > 1 pa.serialize_pandas(pd.DataFrame()) ~/Envs/dask-dev/lib/python3.6/site-packages/pyarrow/ipc.py in serialize_pandas(df) 158 sink = pa.BufferOutputStream() 159 writer = pa.RecordBatchStreamWriter(sink, batch.schema) --> 160 writer.write_batch(batch) 161 writer.close() 162 return sink.get_result() pyarrow/ipc.pxi in pyarrow.lib._RecordBatchWriter.write_batch (/Users/travis/build/apache/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-3.6/lib.cxx:59238)() pyarrow/error.pxi in pyarrow.lib.check_status (/Users/travis/build/apache/arrow-dist/arrow/python/build/temp.macosx-10.6-intel-3.6/lib.cxx:8113)() ArrowNotImplementedError: Unable to convert type: null {code} Presumably {{pa.deserialize_pandas}} will need a fix as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1585) serialize_pandas round trip fails on integer columns
Tom Augspurger created ARROW-1585: - Summary: serialize_pandas round trip fails on integer columns Key: ARROW-1585 URL: https://issues.apache.org/jira/browse/ARROW-1585 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 This roundtrip fails, since the Integer column isn't converted to a string after deserializing {code:python} In [1]: import pandas as pd im In [2]: import pyarrow as pa In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, 2]}))).columns Out[3]: Index(['0'], dtype='object') {code} That should be an {{ Int64Index([0]) }} for the columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1586) [PYTHON] serialize_pandas roundtrip loses columns name
Tom Augspurger created ARROW-1586: - Summary: [PYTHON] serialize_pandas roundtrip loses columns name Key: ARROW-1586 URL: https://issues.apache.org/jira/browse/ARROW-1586 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Priority: Minor Fix For: 0.8.0 The serialize / deserialize roundtrip loses {{ df.columns.name }} {code:python} In [1]: import pandas as pd In [2]: import pyarrow as pa In [3]: df = pd.DataFrame([[1, 2]], columns=pd.Index(['a', 'b'], name='col_name')) In [4]: df.columns.name Out[4]: 'col_name' In [5]: pa.deserialize_pandas(pa.serialize_pandas(df)).columns.name {code} Is this in scope for pyarrow? I suspect it would require an update to the pandas section of the Schema metadata. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-1585) serialize_pandas round trip fails on integer columns
[ https://issues.apache.org/jira/browse/ARROW-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174086#comment-16174086 ] Tom Augspurger commented on ARROW-1585: --- Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 2]}))).columns }} (an int, not a string). Agreed that restricting field names to strings is best. Being able to reconstruct the original from the metadata is sufficient. > serialize_pandas round trip fails on integer columns > > > Key: ARROW-1585 > URL: https://issues.apache.org/jira/browse/ARROW-1585 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.0 >Reporter: Tom Augspurger >Priority: Minor > Fix For: 0.8.0 > > > This roundtrip fails, since the Integer column isn't converted to a string > after deserializing > {code:python} > In [1]: import pandas as pd > im > In [2]: import pyarrow as pa > In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, > 2]}))).columns > Out[3]: Index(['0'], dtype='object') > {code} > That should be an {{ Int64Index([0]) }} for the columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (ARROW-1585) serialize_pandas round trip fails on integer columns
[ https://issues.apache.org/jira/browse/ARROW-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16174086#comment-16174086 ] Tom Augspurger edited comment on ARROW-1585 at 9/21/17 1:11 AM: Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 2]}) }} (an int, not a string). Agreed that restricting field names to strings is best. Being able to reconstruct the original from the metadata is sufficient. was (Author: tomaugspurger): Sorry, yes, I meant for the original data to be {{ pd.DataFrame({0: [1, 2]}))).columns }} (an int, not a string). Agreed that restricting field names to strings is best. Being able to reconstruct the original from the metadata is sufficient. > serialize_pandas round trip fails on integer columns > > > Key: ARROW-1585 > URL: https://issues.apache.org/jira/browse/ARROW-1585 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.7.0 >Reporter: Tom Augspurger >Priority: Minor > Fix For: 0.8.0 > > > This roundtrip fails, since the Integer column isn't converted to a string > after deserializing > {code:python} > In [1]: import pandas as pd > im > In [2]: import pyarrow as pa > In [3]: pa.deserialize_pandas(pa.serialize_pandas(pd.DataFrame({"0": [1, > 2]}))).columns > Out[3]: Index(['0'], dtype='object') > {code} > That should be an {{ Int64Index([0]) }} for the columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1593) [PYTHON] serialize_pandas should pass through the preserve_index keyword
Tom Augspurger created ARROW-1593: - Summary: [PYTHON] serialize_pandas should pass through the preserve_index keyword Key: ARROW-1593 URL: https://issues.apache.org/jira/browse/ARROW-1593 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.7.0 Reporter: Tom Augspurger Assignee: Tom Augspurger Priority: Minor Fix For: 0.8.0 I'm doing some benchmarking of Arrow serialization for dask.distributed to serialize dataframes. Overall things look good compared to the current implementation (using pickle). The biggest difference was pickle's ability to use pandas' RangeIndex to avoid serializing the entire Index of values when possible. I suspect that a "range type" isn't in scope for arrow, but in the meantime applications using Arrow could detect the `RangeIndex`, and pass {{ pyarrow.serialize_pandas(df, preserve_index=False) }} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ARROW-2667) [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499142#comment-16499142 ] Tom Augspurger commented on ARROW-2667: --- Note that pandas' `take` is a bit complicated by trying to satisfy two APIs simultaneously. There's the NumPy-style take from [https://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html,] where negative indices mean slices from the end. And then there's the "pandas" style `take` where `-1` means "indicator for missing values, which should be filled with the `na_value` parameter." Other negative numbers are not allowed. I'm not sure which is more appropriate for Arrow, but wanted to share a bit of background. > [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray > - > > Key: ARROW-2667 > URL: https://issues.apache.org/jira/browse/ARROW-2667 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Uwe L. Korn >Priority: Major > > We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a > list of indices and returns a reordered array. > For reference, see Pandas' interface: > https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-8462) Crash in lib.concat_tables on Windows
Tom Augspurger created ARROW-8462: - Summary: Crash in lib.concat_tables on Windows Key: ARROW-8462 URL: https://issues.apache.org/jira/browse/ARROW-8462 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.16.0 Reporter: Tom Augspurger This crashes for me with pyarrow 0.16 on my Windows VM {{ import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{ concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8462) Crash in lib.concat_tables on Windows
[ https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Augspurger updated ARROW-8462: -- Description: This crashes for me with pyarrow 0.16 on my Windows VM {{import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} was: This crashes for me with pyarrow 0.16 on my Windows VM {{ import pyarrow as pa import pandas as pd t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) print("concat") pa.lib.concat_tables([t]) print('done') }} Installed pyarrow from conda-forge. I'm not really sure how to get more debug info on windows unfortunately. With `python -X faulthandler` I see {{ concat Windows fatal exception: access violation Current thread 0x04f8 (most recent call first): File "bug.py", line 6 in (module) }} > Crash in lib.concat_tables on Windows > - > > Key: ARROW-8462 > URL: https://issues.apache.org/jira/browse/ARROW-8462 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Tom Augspurger >Priority: Major > > This crashes for me with pyarrow 0.16 on my Windows VM > {{import pyarrow as pa > import pandas as pd > t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) > print("concat") > pa.lib.concat_tables([t]) > print('done') > }} > Installed pyarrow from conda-forge. I'm not really sure how to get more debug > info on windows unfortunately. With `python -X faulthandler` I see > {{concat > Windows fatal exception: access violation > Current thread 0x04f8 (most recent call first): > File "bug.py", line 6 in (module) > }} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8462) Crash in lib.concat_tables on Windows
[ https://issues.apache.org/jira/browse/ARROW-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17084111#comment-17084111 ] Tom Augspurger commented on ARROW-8462: --- [~kszucs] I've confirmed that it's fixed with pyarrow 0.16.1.dev552. Thanks! > Crash in lib.concat_tables on Windows > - > > Key: ARROW-8462 > URL: https://issues.apache.org/jira/browse/ARROW-8462 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0 >Reporter: Tom Augspurger >Priority: Major > > This crashes for me with pyarrow 0.16 on my Windows VM > {code:python} > import pyarrow as pa > import pandas as pd > t = pa.Table.from_pandas(pd.DataFrame({"A": [1, 2]})) > print("concat") > pa.lib.concat_tables([t]) > print('done') > {code} > Installed pyarrow from conda-forge. I'm not really sure how to get more debug > info on windows unfortunately. With `python -X faulthandler` I see > {code} > Windows fatal exception: access violation > Current thread 0x04f8 (most recent call first): > File "bug.py", line 6 in (module) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7782) [Python] Losing index information when using write_to_dataset with partition_cols
[ https://issues.apache.org/jira/browse/ARROW-7782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123905#comment-17123905 ] Tom Augspurger commented on ARROW-7782: --- Joris, was this fix included in 0.17.1? Or is it just for 1.0? > [Python] Losing index information when using write_to_dataset with > partition_cols > - > > Key: ARROW-7782 > URL: https://issues.apache.org/jira/browse/ARROW-7782 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Environment: pyarrow==0.15.1 >Reporter: Ludwik Bielczynski >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > One cannot save the index when using {{pyarrow.parquet.write_to_dataset()}} > with given partition_cols arguments. Here I have created a minimal example > which shows the issue: > {code:java} > > from pathlib import Path > import pandas as pd > from pyarrow import Table > from pyarrow.parquet import write_to_dataset, read_table > path = Path('/home/user/trials') > file_name = 'local_database.parquet' > df = pd.DataFrame({"A": [1, 2, 3], "B": ['a', 'a', 'b']}, > index=pd.Index(['a', 'b', 'c'], > name='idx')) > table = Table.from_pandas(df) > write_to_dataset(table, > str(path / file_name), > partition_cols=['B'] > ) > df_read = read_table(str(path / file_name)) > df_read.to_pandas() > {code} > > The issue is rather important for pandas and dask users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-13410) Implement min_max kernel for array[string]
Tom Augspurger created ARROW-13410: -- Summary: Implement min_max kernel for array[string] Key: ARROW-13410 URL: https://issues.apache.org/jira/browse/ARROW-13410 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Affects Versions: 4.0.1 Reporter: Tom Augspurger As noted in https://github.com/pandas-dev/pandas/issues/42597, `pyarrow.compute.min_max` on a string dtype array currently raises. Here's an example from Python {{ In [1]: import pyarrow, pyarrow.compute In [2]: a = pyarrow.array(['c', 'a', 'b']) In [4]: pyarrow.compute.min_max(a) --- ArrowNotImplementedError Traceback (most recent call last) in > 1 pyarrow.compute.min_max(a) ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/compute.py in min_max(array, options, memory_pool, **kwargs) ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call() ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowNotImplementedError: Function min_max has no kernel matching input types (array[string]) }} -- This message was sent by Atlassian Jira (v8.3.4#803005)