Joris Van den Bossche created ARROW-9009:
--------------------------------------------
Summary: [C++][Dataset] ARROW:schema should be removed from
schema's metadata when reading Parquet files
Key: ARROW-9009
URL: https://issues.apache.org/jira/browse/ARROW-9009
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: Joris Van den Bossche
When reading a parquet file (which was written by Arrow) with the datasets API,
it preserves the "ARROW:schema" field in the metadata:
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")
dataset = ds.dataset("test.parquet", format="parquet")
{code}
In [7]: dataset.schema
Out[7]:
a: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
In [8]: dataset.to_table().schema
Out[8]:
a: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
{code}
while when reading with the `parquet` module reader, we do not preserve this
metadata:
{code}
In [9]: pq.read_table("test.parquet").schema
Out[9]:
a: int64
-- field metadata --
PARQUET:field_id: '1'
{code}
Since the "ARROW:schema" information is used to properly reconstruct the Arrow
schema from the ParquetSchema, it is no longer needed once you already have the
arrow schema, so it's probably not needed to keep it as metadata in the arrow
schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)