Hi all,
While investigating a parquet-java issue with the file_offset field in ColumnChunk [1] I discovered that it appears parquet java does not (and perhaps never did?) write a copy of the ColumnMetaData following the column chunk data. This IMO violates the specification[2]. Instead, parquet-java seems to exclusively use the "optional" copy in the footer. Given that this issue has AFAICT never resulted in compatibility issues with other parquet readers, I'm wondering if it's safe to assume no one actually uses the mandated copy trailing the chunk data. In that case, would it make sense to modify the specification to match the reality on the ground? I would propose modifying the spec to state that the ColumnMetaData following the chunk data is also optional. Given that the file_offset field is required, I'd also propose adding language to the effect that if the value of file_offset is 0, then no such metadata is present in the file.

Thoughts?

Thanks,
Ed

[1] https://issues.apache.org/jira/browse/PARQUET-2139
[2] https://github.com/apache/parquet-format?tab=readme-ov-file#file-format

Reply via email to