hi Vasilis, The Arrow schema is used to restore metadata (like timestamp time zones) and reconstruct other Arrow types which might otherwise be lost in the roundtrip (like returning data as dictionary-encoded if it was written originally that way). This can be disabled disabling the store_schema option in ArrowWriterProperties
You are right that schema metadata is being duplicated both in the ARROW:schema and in the Parquet schema-level metadata — I believe this is a bug and we should fix it either by not storing the Arrow metadata in the Parquet metadata (only storing the metadata in ARROW:schema) or dropping the metadata from ARROW:schema and using that only for restoring data types and type metadata. https://issues.apache.org/jira/browse/ARROW-14303 Thanks, Wes On Tue, Oct 12, 2021 at 4:40 AM Vasilis Themelis <[email protected]> wrote: > > Hi, > > It looks like pyarrow adds some metadata under 'ARROW:schema' that duplicates > the rest of the key-value metadata in the resulting parquet file: > > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > import base64 as b64 > > df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']}) > table = pa.Table.from_pandas(df) > pq.write_table(table, "example.parquet") > metadata = pq.read_metadata("example.parquet").metadata > print("==== All metadata ====") > print(metadata) > print("") > print("==== ARROW:schema ====") > print(metadata[b'ARROW:schema']) > print("") > print("==== b64 decoded ====") > print(b64.b64decode(metadata[b'ARROW:schema'])) > > The above should show the duplication between "All metadata" and "b64 > decoded" ARROW:schema. > > What is the reason for this? Is there a good use for ARROW:schema? > > I have used other libraries to write parquet files without an issue and none > of them adds the 'ARROW:schema' metadata. I had no issues with reading their > output files with pyarrow or similar. As an example, here is the result of > writing the same dataframe into parquet using fastparquet: > > from fastparquet import write > write("example-fq.parquet", df) > print(pq.read_metadata("example-fq.parquet").metadata) > > Also, given that this duplication can significantly increase the size of the > file when there is a large amount of metadata stored, would it be possible to > optionally disable writing 'ARROW:schema' if the output files are still > functional? > > Vasilis Themelis
