Hi,

It looks like pyarrow adds some metadata under 'ARROW:schema' that
duplicates the rest of the key-value metadata in the resulting parquet file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import base64 as b64

df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
table = pa.Table.from_pandas(df)
pq.write_table(table, "example.parquet")
metadata = pq.read_metadata("example.parquet").metadata
print("==== All metadata ====")
print(metadata)
print("")
print("==== ARROW:schema ====")
print(metadata[b'ARROW:schema'])
print("")
print("==== b64 decoded ====")
print(b64.b64decode(metadata[b'ARROW:schema']))

The above should show the duplication between "All metadata" and "b64
decoded" ARROW:schema.

What is the reason for this? Is there a good use for ARROW:schema?

I have used other libraries to write parquet files without an issue and
none of them adds the 'ARROW:schema' metadata. I had no issues with reading
their output files with pyarrow or similar. As an example, here is the
result of writing the same dataframe into parquet using fastparquet:

from fastparquet import write
write("example-fq.parquet", df)
print(pq.read_metadata("example-fq.parquet").metadata)

Also, given that this duplication can significantly increase the size of
the file when there is a large amount of metadata stored, would it be
possible to optionally disable writing 'ARROW:schema' if the output files
are still functional?

Vasilis Themelis

Reply via email to