Hi Jiayi, My guess is that the splitting is a parquet-java client feature and not part of the spec?
Using a single address to represent each rowgroup (whether that is the true 'midpoint' or not) guarantees that each rowgroup is only in one split, but it doesn't guarantee optimal splitting. However, it performs well when the assumptions you listed hold. I've seen lots of parquet files where the assumptions don't hold. But parquet-java's split will perform correctly - if suboptimally - on them. best, Will On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote: > Hi all, > > While looking into how file splits work, we noticed that parquet-java > assigns a row group to a split using the row group’s midpoint > (parquet-java code > link > < > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > > ): > > start = offset of the first column chunk > > midpoint = start + total_compressed_size / 2 > > This seems to rely on two assumptions: > > 1. > > Column chunks are physically stored in schema order, so the first column > in RowGroup.columns also has the smallest file offset. > 2. > > The column chunks of a row group are stored contiguously (i.e. row > groups are not overlapping) > > Are these requirements guaranteed by the Parquet format specification, or > are they only conventions followed by common writers? > > Based on our reading, the Parquet specification only requires the column > metadata to follow the same order as the SchemaElement list in > FileMetaData. > It does not appear to require the corresponding file offsets to be in > ascending order, nor does it explicitly prohibit row groups from > overlapping. > > Is that the correct interpretation? Either way, it would be helpful for the > specification to clarify these requirements and remove the ambiguity. Best, > Jiayi >
