Joris Van den Bossche created ARROW-6492:
--------------------------------------------
Summary: [Python] file written with latest fastparquet cannot be
read with latest pyarrow
Key: ARROW-6492
URL: https://issues.apache.org/jira/browse/ARROW-6492
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Joris Van den Bossche
>From report on the pandas issue tracker:
>https://github.com/pandas-dev/pandas/issues/28252
With the latest released versions of fastparquet (0.3.2) and pyarrow (0.14.1),
writing a file with pandas using the fastparquet engine cannot be read with the
pyarrow engine:
{code}
df = pd.DataFrame({'A': [1, 2, 3]})
df.to_parquet("test.parquet", engine="fastparquet", compression=None)
pd.read_parquet("test.parquet", engine="pyarrow")
{code}
gives the following error when reading:
{code}
----> 1 pd.read_parquet("test.parquet", engine="pyarrow")
~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in
read_parquet(path, engine, columns, **kwargs)
292
293 impl = get_engine(engine)
--> 294 return impl.read(path, columns=columns, **kwargs)
~/miniconda3/lib/python3.7/site-packages/pandas/io/parquet.py in read(self,
path, columns, **kwargs)
123 kwargs["use_pandas_metadata"] = True
124 result = self.api.parquet.read_table(
--> 125 path, columns=columns, **kwargs
126 ).to_pandas()
127 if should_close:
~/miniconda3/lib/python3.7/site-packages/pyarrow/array.pxi in
pyarrow.lib._PandasConvertible.to_pandas()
~/miniconda3/lib/python3.7/site-packages/pyarrow/table.pxi in
pyarrow.lib.Table._to_pandas()
~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in
table_to_blockmanager(options, table, categories, ignore_metadata)
642 column_indexes = pandas_metadata.get('column_indexes', [])
643 index_descriptors = pandas_metadata['index_columns']
--> 644 table = _add_any_metadata(table, pandas_metadata)
645 table, index = _reconstruct_index(table, index_descriptors,
646 all_columns)
~/miniconda3/lib/python3.7/site-packages/pyarrow/pandas_compat.py in
_add_any_metadata(table, pandas_metadata)
965 raw_name = 'None'
966
--> 967 idx = schema.get_field_index(raw_name)
968 if idx != -1:
969 if col_meta['pandas_type'] == 'datetimetz':
~/miniconda3/lib/python3.7/site-packages/pyarrow/types.pxi in
pyarrow.lib.Schema.get_field_index()
~/miniconda3/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so
in string.from_py.__pyx_convert_string_from_py_std__in_string()
TypeError: expected bytes, dict found
{code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)