Hey Antoine, First of all, love the recent uptake in activity on the Parquet side. I'm on holiday, but I'll definitly catch up when I return.
I wanted to respond to this particular mail since we do store various fields in the metadata for Apache Iceberg. For example: - The JSON serialized Iceberg schema that was used when writing the file: https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274 - I <https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L274>n the case of delete files, we write the kind of file (positional or equality), and in the case of equality, also the field IDs: https://github.com/apache/iceberg/blob/bd046f844a1cbad6c98919d8ea63176aeae78d33/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L905-L910 This is mostly for debugging purposes. The schema could become quite big as it is proportional to the number of columns. The metadata is mostly set for debugging purposes and is not part of the official Iceberg spec. I hope this helps! Kind regards, Fokko Op do 16 mei 2024 om 21:17 schreef Antoine Pitrou <[email protected]>: > > Hello, > > In https://github.com/apache/parquet-format/pull/242 the question came > of the size and overhead of key-value metadata entries in real world > Parquet files (basically, user-defined metadata attached either to the > entire file or to individual columns). Do people have insight to share > about the typical number of metadata entries in a file or column, and > their typical byte size? > > Regards > > Antoine. > > >
