Hi Will, Thanks for replying. > Using a single address to represent each rowgroup (whether that is the true 'midpoint' or not) guarantees that each rowgroup is only in one split.
This is my understanding as well. However, things become a bit more confusing when it comes to the optional `RowGroup.file_offset` field: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577 As far as I understand, some versions of parquet-java used to emit incorrect file_offset values for row groups, and parquet-java uses this function to invalidate those offsets. The function appears to rely on the two assumptions I mentioned earlier: 1. Column chunks are physically stored in schema order, so the first column in RowGroup.columns also has the smallest file offset. 2. The column chunks of a row group are stored contiguously; that is, row groups do not overlap. Since parquet-java is one of the main sources of truth for Parquet readers and writers, I strongly suspect that there are some misunderstandings within the community. The specification should be clearer about what is allowed and what is disallowed. Best, Jiayi Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道: > Hi Jiayi, > > My guess is that the splitting is a parquet-java client feature and not > part of the spec? > > Using a single address to represent each rowgroup (whether that is the true > 'midpoint' or not) guarantees that each rowgroup is only in one split, but > it doesn't guarantee optimal splitting. However, it performs well when the > assumptions you listed hold. > > I've seen lots of parquet files where the assumptions don't hold. But > parquet-java's split will perform correctly - if suboptimally - on them. > > best, Will > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote: > > > Hi all, > > > > While looking into how file splits work, we noticed that parquet-java > > assigns a row group to a split using the row group’s midpoint > > (parquet-java code > > link > > < > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > > > > ): > > > > start = offset of the first column chunk > > > > midpoint = start + total_compressed_size / 2 > > > > This seems to rely on two assumptions: > > > > 1. > > > > Column chunks are physically stored in schema order, so the first > column > > in RowGroup.columns also has the smallest file offset. > > 2. > > > > The column chunks of a row group are stored contiguously (i.e. row > > groups are not overlapping) > > > > Are these requirements guaranteed by the Parquet format specification, or > > are they only conventions followed by common writers? > > > > Based on our reading, the Parquet specification only requires the column > > metadata to follow the same order as the SchemaElement list in > > FileMetaData. > > It does not appear to require the corresponding file offsets to be in > > ascending order, nor does it explicitly prohibit row groups from > > overlapping. > > > > Is that the correct interpretation? Either way, it would be helpful for > the > > specification to clarify these requirements and remove the ambiguity. > Best, > > Jiayi > > >
