ColumnMetaData location

Ed Seidl Mon, 03 Jun 2024 10:59:04 -0700

Hi all,

While investigating a parquet-java issue with the file_offset field inColumnChunk [1] I discovered that it appears parquet java does not (andperhaps never did?) write a copy of the ColumnMetaData following thecolumn chunk data. This IMO violates the specification[2]. Instead,parquet-java seems to exclusively use the "optional" copy in the footer.Given that this issue has AFAICT never resulted in compatibility issueswith other parquet readers, I'm wondering if it's safe to assume no oneactually uses the mandated copy trailing the chunk data. In that case,would it make sense to modify the specification to match the reality onthe ground? I would propose modifying the spec to state that theColumnMetaData following the chunk data is also optional. Given that thefile_offset field is required, I'd also propose adding language to theeffect that if the value of file_offset is 0, then no such metadata ispresent in the file.


Thoughts?

Thanks,
Ed

[1] https://issues.apache.org/jira/browse/PARQUET-2139
[2] https://github.com/apache/parquet-format?tab=readme-ov-file#file-format

ColumnMetaData location

Reply via email to