> While looking into how file splits work, we noticed that parquet-java > assigns a row group to a split using the row group’s midpoint
Could you be clearer what "split" means in parquet-java and if it is a Java specific feature or something that is defined in the parquet spec? I don't quite understand if this is something specific to parquet-java or parquet as a whole. The link you provided https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 Goes to a function that implies it is for testing > // Visible for testing > static FileMetaData filterFileMetaDataByMidpoint(FileMetaData metaData, RangeMetadataFilter filter) { > Since parquet-java is one of the main sources of truth for Parquet readers and writers, I strongly suspect that there are some misunderstandings within the community. FWIW I think the C/C++ implementation (via pyarrow/python bindings) is also very widely used, and parquet-java can't be used as a test for what data will be found in Parquet files. On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev <[email protected]> wrote: > Hi Jiayi, > > yes I think you're right: the current split code has a few checks that > would fail on some real parquet files I have that aren't laid out > sequentially but still, by my reading, follow the letter of the spec. > > On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote: > > > Hi Will, > > > > Thanks for replying. > > > Using a single address to represent each rowgroup (whether that is the > > true 'midpoint' or not) guarantees that each rowgroup is only in one > split. > > > > This is my understanding as well. However, things become a bit more > > confusing when it comes to the optional `RowGroup.file_offset` field: > > > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577 > > > > As far as I understand, some versions of parquet-java used to emit > > incorrect file_offset values for row groups, and parquet-java uses this > > function to invalidate those offsets. > > > > The function appears to rely on the two assumptions I mentioned earlier: > > > > 1. Column chunks are physically stored in schema order, so the first > > column in RowGroup.columns also has the smallest file offset. > > 2. The column chunks of a row group are stored contiguously; that is, > > row groups do not overlap. > > > > Since parquet-java is one of the main sources of truth for Parquet > readers > > and writers, I strongly suspect that there are some misunderstandings > > within the community. The specification should be clearer about what is > > allowed and what is disallowed. > > Best, > > Jiayi > > > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道: > > > > > Hi Jiayi, > > > > > > My guess is that the splitting is a parquet-java client feature and not > > > part of the spec? > > > > > > Using a single address to represent each rowgroup (whether that is the > > true > > > 'midpoint' or not) guarantees that each rowgroup is only in one split, > > but > > > it doesn't guarantee optimal splitting. However, it performs well when > > the > > > assumptions you listed hold. > > > > > > I've seen lots of parquet files where the assumptions don't hold. But > > > parquet-java's split will perform correctly - if suboptimally - on > them. > > > > > > best, Will > > > > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote: > > > > > > > Hi all, > > > > > > > > While looking into how file splits work, we noticed that parquet-java > > > > assigns a row group to a split using the row group’s midpoint > > > > (parquet-java code > > > > link > > > > < > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > > > > > > > > ): > > > > > > > > start = offset of the first column chunk > > > > > > > > midpoint = start + total_compressed_size / 2 > > > > > > > > This seems to rely on two assumptions: > > > > > > > > 1. > > > > > > > > Column chunks are physically stored in schema order, so the first > > > column > > > > in RowGroup.columns also has the smallest file offset. > > > > 2. > > > > > > > > The column chunks of a row group are stored contiguously (i.e. row > > > > groups are not overlapping) > > > > > > > > Are these requirements guaranteed by the Parquet format > specification, > > or > > > > are they only conventions followed by common writers? > > > > > > > > Based on our reading, the Parquet specification only requires the > > column > > > > metadata to follow the same order as the SchemaElement list in > > > > FileMetaData. > > > > It does not appear to require the corresponding file offsets to be in > > > > ascending order, nor does it explicitly prohibit row groups from > > > > overlapping. > > > > > > > > Is that the correct interpretation? Either way, it would be helpful > for > > > the > > > > specification to clarify these requirements and remove the ambiguity. > > > Best, > > > > Jiayi > > > > > > > > > >
