hi Vasilis,

The Arrow schema is used to restore metadata (like timestamp time
zones) and reconstruct other Arrow types which might otherwise be lost
in the roundtrip (like returning data as dictionary-encoded if it was
written originally that way). This can be disabled disabling the
store_schema option in ArrowWriterProperties

You are right that schema metadata is being duplicated both in the
ARROW:schema and in the Parquet schema-level metadata — I believe this
is a bug and we should fix it either by not storing the Arrow metadata
in the Parquet metadata (only storing the metadata in ARROW:schema) or
dropping the metadata from ARROW:schema and using that only for
restoring data types and type metadata.

https://issues.apache.org/jira/browse/ARROW-14303

Thanks,
Wes

On Tue, Oct 12, 2021 at 4:40 AM Vasilis Themelis <[email protected]> wrote:
>
> Hi,
>
> It looks like pyarrow adds some metadata under 'ARROW:schema' that duplicates 
> the rest of the key-value metadata in the resulting parquet file:
>
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import base64 as b64
>
> df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "example.parquet")
> metadata = pq.read_metadata("example.parquet").metadata
> print("==== All metadata ====")
> print(metadata)
> print("")
> print("==== ARROW:schema ====")
> print(metadata[b'ARROW:schema'])
> print("")
> print("==== b64 decoded ====")
> print(b64.b64decode(metadata[b'ARROW:schema']))
>
> The above should show the duplication between "All metadata" and "b64 
> decoded" ARROW:schema.
>
> What is the reason for this? Is there a good use for ARROW:schema?
>
> I have used other libraries to write parquet files without an issue and none 
> of them adds the 'ARROW:schema' metadata. I had no issues with reading their 
> output files with pyarrow or similar. As an example, here is the result of 
> writing the same dataframe into parquet using fastparquet:
>
> from fastparquet import write
> write("example-fq.parquet", df)
> print(pq.read_metadata("example-fq.parquet").metadata)
>
> Also, given that this duplication can significantly increase the size of the 
> file when there is a large amount of metadata stored, would it be possible to 
> optionally disable writing 'ARROW:schema' if the output files are still 
> functional?
>
> Vasilis Themelis

Reply via email to