Re: [C++][python] Arrow Parquet metadata issues with round trip read/write table

Weston Pace Thu, 29 Apr 2021 08:50:09 -0700

You could copy the parquet field ids when you originally read in the data
and write them out to a custom metadata field.  This will get saved
(unmodified) into the parquet file.  Then, after reading the parquet file,
you could copy your custom metadata back into the field_id field (replacing
the made up field IDs).


This won't help if your workflow is (external tool -> arrow -> parquet file
-> external tool) but it may help if your workflow is (external tool ->
arrow -> parquet file -> arrow -> external tool)

On Thu, Apr 29, 2021 at 1:56 AM Ted Gooch <[email protected]> wrote:

> Hi,
>
> I'm having an issue where I'm reading in some parquet data, and writing it
> back, and when I write the field_id's don't match the schema that I
> provided to pyarrow.parquet.write_table. I browsed through the PR that
> added support for field_id metadata, and it looks like this is a known
> behavior and has this currently open issue:
> https://issues.apache.org/jira/browse/PARQUET-1798
>
> Is there any way in the current API to get the write_table to use the
> metadata from the provided schema? Or is the DFS assignment of field_id's
> the only behavior pending the issue referenced above?
>
> *Basic Example here:*
>
> import pyarrow.parquet as pq
> print("------------ORIGINAL------------")
> print(arrow_tbl.schema)
> pq.write_table(arrow_tbl, 'example.parquet')
> read_back = pq.ParquetFile('example.parquet')
> print("------------READ BACK------------")
> print(read_back.schema_arrow)
>
> *Output*
> ------------ORIGINAL------------
> tester_flags: list<element: string>
>   child 0, element: string
>     -- field metadata --
>     PARQUET:field_id: '36'
>   -- field metadata --
>   PARQUET:field_id: '16'
> signup_country_iso_code: string
>   -- field metadata --
>   PARQUET:field_id: '17'
> -- schema metadata --
> iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
> + 5286
> ------------READ BACK------------
> tester_flags: list<element: string>
>   child 0, element: string
>     -- field metadata --
>     PARQUET:field_id: '3'
>   -- field metadata --
>   PARQUET:field_id: '1'
> signup_country_iso_code: string
>   -- field metadata --
>   PARQUET:field_id: '4'
> -- schema metadata --
> iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
> + 5286
>
>

Re: [C++][python] Arrow Parquet metadata issues with round trip read/write table

Reply via email to