[ https://issues.apache.org/jira/browse/ARROW-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925656#comment-16925656 ]
Joris Van den Bossche commented on ARROW-6492: ---------------------------------------------- This is related to a difference in the pandas metadata written by both libraries: {code} In [58]: import pyarrow.parquet as pq In [59]: pq.read_schema("test.parquet").pandas_metadata Out[59]: {'column_indexes': [{'field_name': None, 'metadata': None, 'name': None, 'numpy_type': 'object', 'pandas_type': 'mixed-integer'}], 'columns': [{'metadata': None, 'name': 'A', 'numpy_type': 'int64', 'pandas_type': 'int64'}], 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}], 'pandas_version': '0.25.0'} In [60]: df.to_parquet("test_pa.parquet", engine="pyarrow") In [61]: pq.read_schema("test_pa.parquet").pandas_metadata Out[61]: {'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'stop': 3, 'step': 1}], 'column_indexes': [{'name': None, 'field_name': None, 'pandas_type': 'unicode', 'numpy_type': 'object', 'metadata': {'encoding': 'UTF-8'}}], 'columns': [{'name': 'A', 'field_name': 'A', 'pandas_type': 'int64', 'numpy_type': 'int64', 'metadata': None}], 'creator': {'library': 'pyarrow', 'version': '0.14.1'}, 'pandas_version': '0.25.0'} {code} The difference that is causing the bug is in the {{columns}} field where in the "field_name" key is not written by the fastparquet engine (it does write a field_name in "column_indexes", but not in "columns"). I will open an issue on the fastparquet side to ensure both libraries write consistent metadata, but on the short term let's also fix this in pyarrow (this seems a bug in the code that deals with older files, where there was no "field_name" as well). > [Python] file written with latest fastparquet cannot be read with latest > pyarrow > -------------------------------------------------------------------------------- > > Key: ARROW-6492 > URL: https://issues.apache.org/jira/browse/ARROW-6492 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Reporter: Joris Van den Bossche > Priority: Major > Labels: parquet > > From report on the pandas issue tracker: > https://github.com/pandas-dev/pandas/issues/28252 > With the latest released versions of fastparquet (0.3.2) and pyarrow > (0.14.1), writing a file with pandas using the fastparquet engine cannot be > read with the pyarrow engine: > {code} > df = pd.DataFrame({'A': [1, 2, 3]}) > df.to_parquet("test.parquet", engine="fastparquet", compression=None) > > > pd.read_parquet("test.parquet", engine="pyarrow") > {code} > gives the following error when reading: > {code} > ----> 1 pd.read_parquet("test.parquet", engine="pyarrow") > ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in > read_parquet(path, engine, columns, **kwargs) > 292 > 293 impl = get_engine(engine) > --> 294 return impl.read(path, columns=columns, **kwargs) > ~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self, > path, columns, **kwargs) > 123 kwargs["use_pandas_metadata"] = True > 124 result = self.api.parquet.read_table( > --> 125 path, columns=columns, **kwargs > 126 ).to_pandas() > 127 if should_close: > ~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in > pyarrow.lib._PandasConvertible.to_pandas() > ~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in > pyarrow.lib.Table._to_pandas() > ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in > table_to_blockmanager(options, table, categories, ignore_metadata) > 642 column_indexes = pandas_metadata.get('column_indexes', []) > 643 index_descriptors = pandas_metadata['index_columns'] > --> 644 table = _add_any_metadata(table, pandas_metadata) > 645 table, index = _reconstruct_index(table, index_descriptors, > 646 all_columns) > ~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in > _add_any_metadata(table, pandas_metadata) > 965 raw_name = 'None' > 966 > --> 967 idx = schema.get_field_index(raw_name) > 968 if idx != -1: > 969 if col_meta['pandas_type'] == 'datetimetz': > ~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in > pyarrow.lib.Schema.get_field_index() > ~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so > in string.from_py.__pyx_convert_string_from_py_std__in_string() > TypeError: expected bytes, dict found > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)