You could copy the parquet field ids when you originally read in the data and write them out to a custom metadata field. This will get saved (unmodified) into the parquet file. Then, after reading the parquet file, you could copy your custom metadata back into the field_id field (replacing the made up field IDs).
This won't help if your workflow is (external tool -> arrow -> parquet file -> external tool) but it may help if your workflow is (external tool -> arrow -> parquet file -> arrow -> external tool) On Thu, Apr 29, 2021 at 1:56 AM Ted Gooch <[email protected]> wrote: > Hi, > > I'm having an issue where I'm reading in some parquet data, and writing it > back, and when I write the field_id's don't match the schema that I > provided to pyarrow.parquet.write_table. I browsed through the PR that > added support for field_id metadata, and it looks like this is a known > behavior and has this currently open issue: > https://issues.apache.org/jira/browse/PARQUET-1798 > > Is there any way in the current API to get the write_table to use the > metadata from the provided schema? Or is the DFS assignment of field_id's > the only behavior pending the issue referenced above? > > *Basic Example here:* > > import pyarrow.parquet as pq > print("------------ORIGINAL------------") > print(arrow_tbl.schema) > pq.write_table(arrow_tbl, 'example.parquet') > read_back = pq.ParquetFile('example.parquet') > print("------------READ BACK------------") > print(read_back.schema_arrow) > > *Output* > ------------ORIGINAL------------ > tester_flags: list<element: string> > child 0, element: string > -- field metadata -- > PARQUET:field_id: '36' > -- field metadata -- > PARQUET:field_id: '16' > signup_country_iso_code: string > -- field metadata -- > PARQUET:field_id: '17' > -- schema metadata -- > iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","' > + 5286 > ------------READ BACK------------ > tester_flags: list<element: string> > child 0, element: string > -- field metadata -- > PARQUET:field_id: '3' > -- field metadata -- > PARQUET:field_id: '1' > signup_country_iso_code: string > -- field metadata -- > PARQUET:field_id: '4' > -- schema metadata -- > iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","' > + 5286 > >
