etseidl opened a new issue, #563:
URL: https://github.com/apache/parquet-format/issues/563

   ### Describe the enhancement requested
   
   Following up on a discussion on the dev mailing list 
(https://lists.apache.org/thread/0mp06g0r27s0ynsg3pk54zl5bqc249wg) I'd like to 
propose making the `path_in_schema` field in `ColumnMetaData` optional. As has 
been pointed out elsewhere, this field carries information that is easily 
obtainable from the schema, and is repeated on a per-column-chunk basis, so 
files with many row groups will have many copies of the same information. This 
leads to a good bit of unnecessary bloat in the Parquet footer. Further, in 
addition to file bloat, the cost of parsing this field is quite high, due to 
its `list<string>` typing.
   
   I think it would be worthwhile to embark on the process of deprecating this 
field. In the short term (following the advice in 
[CONTRIBUTING.md](https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#compatibility-and-feature-enablement)),
 we can mark the field `optional` in `parquet.thrift`, but with the proviso 
that writers will continue to emit this field by default for some period of 
time. Users will be given configuration options allowing them to turn off this 
wasteful field if they so choose. Then, once a critical mass of implementations 
and downstream projects have been transitioned, writers will be free to omit 
the field by default.
   
   It's worth pointing out that @Jiayi-Wang-db has reported on the dev list 
that 3 out 5 implementations tested were found to already tolerate the field 
missing, with arrow-rs since 57.0.0 making a 4th implementation. arrow-cpp does 
not appear to rely on the field beyond needing it for thrift to validate, and 
parquet-java needs only minor modifications to tolerate its absence. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to