Hi all,
While investigating a parquet-java issue with the file_offset field in
ColumnChunk [1] I discovered that it appears parquet java does not (and
perhaps never did?) write a copy of the ColumnMetaData following the
column chunk data. This IMO violates the specification[2]. Instead,
parquet-java seems to exclusively use the "optional" copy in the footer.
Given that this issue has AFAICT never resulted in compatibility issues
with other parquet readers, I'm wondering if it's safe to assume no one
actually uses the mandated copy trailing the chunk data. In that case,
would it make sense to modify the specification to match the reality on
the ground? I would propose modifying the spec to state that the
ColumnMetaData following the chunk data is also optional. Given that the
file_offset field is required, I'd also propose adding language to the
effect that if the value of file_offset is 0, then no such metadata is
present in the file.
Thoughts?
Thanks,
Ed
[1] https://issues.apache.org/jira/browse/PARQUET-2139
[2] https://github.com/apache/parquet-format?tab=readme-ov-file#file-format