Re: [python] Duplication of data in 'ARROW:schema' metadata?

Wes McKinney Tue, 12 Oct 2021 12:52:48 -0700

hi Vasilis,

The Arrow schema is used to restore metadata (like timestamp time
zones) and reconstruct other Arrow types which might otherwise be lost
in the roundtrip (like returning data as dictionary-encoded if it was
written originally that way). This can be disabled disabling the
store_schema option in ArrowWriterProperties


You are right that schema metadata is being duplicated both in the
ARROW:schema and in the Parquet schema-level metadata — I believe this
is a bug and we should fix it either by not storing the Arrow metadata
in the Parquet metadata (only storing the metadata in ARROW:schema) or
dropping the metadata from ARROW:schema and using that only for
restoring data types and type metadata.

https://issues.apache.org/jira/browse/ARROW-14303

Thanks,
Wes

On Tue, Oct 12, 2021 at 4:40 AM Vasilis Themelis <[email protected]> wrote:
>
> Hi,
>
> It looks like pyarrow adds some metadata under 'ARROW:schema' that duplicates 
> the rest of the key-value metadata in the resulting parquet file:
>
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import base64 as b64
>
> df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "example.parquet")
> metadata = pq.read_metadata("example.parquet").metadata
> print("==== All metadata ====")
> print(metadata)
> print("")
> print("==== ARROW:schema ====")
> print(metadata[b'ARROW:schema'])
> print("")
> print("==== b64 decoded ====")
> print(b64.b64decode(metadata[b'ARROW:schema']))
>
> The above should show the duplication between "All metadata" and "b64 
> decoded" ARROW:schema.
>
> What is the reason for this? Is there a good use for ARROW:schema?
>
> I have used other libraries to write parquet files without an issue and none 
> of them adds the 'ARROW:schema' metadata. I had no issues with reading their 
> output files with pyarrow or similar. As an example, here is the result of 
> writing the same dataframe into parquet using fastparquet:
>
> from fastparquet import write
> write("example-fq.parquet", df)
> print(pq.read_metadata("example-fq.parquet").metadata)
>
> Also, given that this duplication can significantly increase the size of the 
> file when there is a large amount of metadata stored, would it be possible to 
> optionally disable writing 'ARROW:schema' if the output files are still 
> functional?
>
> Vasilis Themelis

Re: [python] Duplication of data in 'ARROW:schema' metadata?

Reply via email to