TIL parquet-java requires that the offset of the first page/column chunk must be 4. That can be added to your list of assumptions not explicitly spelled out in the spec ;-)
I agree that these things should be made clearer. Especially since the README still states that column chunks can reside in separate files [1]. Cheers, Ed [1] https://github.com/apache/parquet-format#separating-metadata-and-column-data On 2026/06/17 08:16:34 Jiayi Wang wrote: > Hi Will, > > Thanks for replying. > > Using a single address to represent each rowgroup (whether that is the > true 'midpoint' or not) guarantees that each rowgroup is only in one split. > > This is my understanding as well. However, things become a bit more > confusing when it comes to the optional `RowGroup.file_offset` field: > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577 > > As far as I understand, some versions of parquet-java used to emit > incorrect file_offset values for row groups, and parquet-java uses this > function to invalidate those offsets. > > The function appears to rely on the two assumptions I mentioned earlier: > > 1. Column chunks are physically stored in schema order, so the first > column in RowGroup.columns also has the smallest file offset. > 2. The column chunks of a row group are stored contiguously; that is, > row groups do not overlap. > > Since parquet-java is one of the main sources of truth for Parquet readers > and writers, I strongly suspect that there are some misunderstandings > within the community. The specification should be clearer about what is > allowed and what is disallowed. > Best, > Jiayi > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道: > > > Hi Jiayi, > > > > My guess is that the splitting is a parquet-java client feature and not > > part of the spec? > > > > Using a single address to represent each rowgroup (whether that is the true > > 'midpoint' or not) guarantees that each rowgroup is only in one split, but > > it doesn't guarantee optimal splitting. However, it performs well when the > > assumptions you listed hold. > > > > I've seen lots of parquet files where the assumptions don't hold. But > > parquet-java's split will perform correctly - if suboptimally - on them. > > > > best, Will > > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote: > > > > > Hi all, > > > > > > While looking into how file splits work, we noticed that parquet-java > > > assigns a row group to a split using the row group’s midpoint > > > (parquet-java code > > > link > > > < > > > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > > > > > > ): > > > > > > start = offset of the first column chunk > > > > > > midpoint = start + total_compressed_size / 2 > > > > > > This seems to rely on two assumptions: > > > > > > 1. > > > > > > Column chunks are physically stored in schema order, so the first > > column > > > in RowGroup.columns also has the smallest file offset. > > > 2. > > > > > > The column chunks of a row group are stored contiguously (i.e. row > > > groups are not overlapping) > > > > > > Are these requirements guaranteed by the Parquet format specification, or > > > are they only conventions followed by common writers? > > > > > > Based on our reading, the Parquet specification only requires the column > > > metadata to follow the same order as the SchemaElement list in > > > FileMetaData. > > > It does not appear to require the corresponding file offsets to be in > > > ascending order, nor does it explicitly prohibit row groups from > > > overlapping. > > > > > > Is that the correct interpretation? Either way, it would be helpful for > > the > > > specification to clarify these requirements and remove the ambiguity. > > Best, > > > Jiayi > > > > > >
