[ https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538269#comment-17538269 ]
Antoine Pitrou commented on ARROW-16339: ---------------------------------------- Q1, Q2: agreed with [~emkornfield] Q3: no idea about this Q4: it would sound reasonable to merge metadata, with Arrow keys taking precedence over Parquet keys (if Parquet defines {{"key": "foo"}} and Arrow defines {{"key": "bar"}}, keep only {{"key": "bar"}}). > [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to > Arrow Schema metadata > ------------------------------------------------------------------------------------------------- > > Key: ARROW-16339 > URL: https://issues.apache.org/jira/browse/ARROW-16339 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Parquet, Python > Reporter: Joris Van den Bossche > Priority: Critical > Fix For: 9.0.0 > > > Context: I ran into this issue when reading Parquet files created by GDAL > (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which > writes files that have custom key_value_metadata, but without storing > ARROW:schema in those metadata (cc [~paleolimbot] > — > Both in reading and writing files, I expected that we would map Arrow > {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. > But apparently this doesn't (always) happen out of the box, and only happens > through the "ARROW:schema" field (which stores the original Arrow schema, and > thus the metadata stored in this schema). > For example, when writing a Table with schema metadata, this is not stored > directly in the Parquet FileMetaData (code below is using branch from > ARROW-16337 to have the {{store_schema}} keyword): > {code:python} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"}) > pq.write_table(table, "test_metadata_with_arrow_schema.parquet") > pq.write_table(table, "test_metadata_without_arrow_schema.parquet", > store_schema=False) > # original schema has metadata > >>> table.schema > a: int64 > -- schema metadata -- > key: 'value' > # reading back only has the metadata in case we stored ARROW:schema > >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema > a: int64 > -- schema metadata -- > key: 'value' > # and not if ARROW:schema is absent > >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema > a: int64 > {code} > It seems that if we store the ARROW:schema, we _also_ store the schema > metadata separately. But if {{store_schema}} is False, we also stop writing > those metadata (not fully sure if this is the intended behaviour, and that's > the reason for the above output): > {code:python} > # when storing the ARROW:schema, we ALSO store key:value metadata > >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata > {b'ARROW:schema': b'/////7AAAAAQAAAAAAAKAA4ABgAFAA...', > b'key': b'value'} > # when not storing the schema, we also don't store the key:value > >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata > >>> is None > True > {code} > On the reading side, it seems that we generally do read custom key/value > metadata into schema metadata. We don't have the pyarrow APIs at the moment > to create such a file (given the above), but with a small patch I could > create such a file: > {code:python} > # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key > >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata > {b'key': b'value'} > # this metadata is now correctly mapped to the Arrow schema metadata > >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet") > a: int64 > -- schema metadata -- > key: 'value' > {code} > But if you have a file that has both custom key/value metadata and an > "ARROW:schema" key, we actually ignore the custom keys, and only look at the > "ARROW:schema" one. > This was the case that I ran into with GDAL, where I have a file with both > keys, but where the custom "geo" key is not also included in the serialized > arrow schema in the "ARROW:schema" key: > {code:python} > # includes both keys in the Parquet file > >>> pq.read_metadata("test_gdal.parquet").metadata > {b'geo': b'{"version":"0.1.0","...', > b'ARROW:schema': b'/////3gBAAAQ...'} > # the "geo" key is lost in the Arrow schema > >>> pq.read_table("test_gdal.parquet").schema.metadata is None > True > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)