Hi,
I'm having an issue where I'm reading in some parquet data, and writing it
back, and when I write the field_id's don't match the schema that I
provided to pyarrow.parquet.write_table. I browsed through the PR that
added support for field_id metadata, and it looks like this is a known
behavior and has this currently open issue:
https://issues.apache.org/jira/browse/PARQUET-1798
Is there any way in the current API to get the write_table to use the
metadata from the provided schema? Or is the DFS assignment of field_id's
the only behavior pending the issue referenced above?
*Basic Example here:*
import pyarrow.parquet as pq
print("------------ORIGINAL------------")
print(arrow_tbl.schema)
pq.write_table(arrow_tbl, 'example.parquet')
read_back = pq.ParquetFile('example.parquet')
print("------------READ BACK------------")
print(read_back.schema_arrow)
*Output*
------------ORIGINAL------------
tester_flags: list<element: string>
child 0, element: string
-- field metadata --
PARQUET:field_id: '36'
-- field metadata --
PARQUET:field_id: '16'
signup_country_iso_code: string
-- field metadata --
PARQUET:field_id: '17'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286
------------READ BACK------------
tester_flags: list<element: string>
child 0, element: string
-- field metadata --
PARQUET:field_id: '3'
-- field metadata --
PARQUET:field_id: '1'
signup_country_iso_code: string
-- field metadata --
PARQUET:field_id: '4'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286