etseidl opened a new issue, #563: URL: https://github.com/apache/parquet-format/issues/563
### Describe the enhancement requested Following up on a discussion on the dev mailing list (https://lists.apache.org/thread/0mp06g0r27s0ynsg3pk54zl5bqc249wg) I'd like to propose making the `path_in_schema` field in `ColumnMetaData` optional. As has been pointed out elsewhere, this field carries information that is easily obtainable from the schema, and is repeated on a per-column-chunk basis, so files with many row groups will have many copies of the same information. This leads to a good bit of unnecessary bloat in the Parquet footer. Further, in addition to file bloat, the cost of parsing this field is quite high, due to its `list<string>` typing. I think it would be worthwhile to embark on the process of deprecating this field. In the short term (following the advice in [CONTRIBUTING.md](https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#compatibility-and-feature-enablement)), we can mark the field `optional` in `parquet.thrift`, but with the proviso that writers will continue to emit this field by default for some period of time. Users will be given configuration options allowing them to turn off this wasteful field if they so choose. Then, once a critical mass of implementations and downstream projects have been transitioned, writers will be free to omit the field by default. It's worth pointing out that @Jiayi-Wang-db has reported on the dev list that 3 out 5 implementations tested were found to already tolerate the field missing, with arrow-rs since 57.0.0 making a 4th implementation. arrow-cpp does not appear to rely on the field beyond needing it for thrift to validate, and parquet-java needs only minor modifications to tolerate its absence. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
