Hi Andrew, Good questions.
"Split" here is the Hadoop `FileSplit` concept, not anything in the Parquet spec: engines like Spark divide a file into byte ranges for parallel tasks, and parquet-java assigns each row group to a split based on the row group’s midpoint (start + total_compressed_size/2). The "// Visible for testing" note just widens the method's visibility, the midpoint logic is the real split-assignment code. My point is that it silently assumes (1) column 0 has the smallest offset and (2) row groups don't overlap, (3) As Ed pointed out, the first column chunk starts at offset 4, neither of which the spec requires. I also checked the Arrow C++ read path, which makes none of these assumptions. It locates each column chunk independently using its own data_page_offset and dictionary_page_offset, never reads the file_offset fields, and splits work purely by row-group index, without any midpoint calculation. So we already have two major implementations that disagree on what they assume about physical layout. That is exactly why I think the specification should explicitly distinguish writer conventions from format guarantees. Happy to file a parquet-format PR with proposed wording if that's the right next step. Best, Jiayi Andrew Lamb <[email protected]> 于2026年6月17日周三 16:00写道: > > While looking into how file splits work, we noticed that parquet-java > > assigns a row group to a split using the row group’s midpoint > > Could you be clearer what "split" means in parquet-java and if it is a Java > specific feature or something that is defined in the parquet spec? I don't > quite understand if this is something specific to parquet-java or parquet > as a whole. > > The link you provided > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > Goes to a function that implies it is for testing > > > // Visible for testing > > static FileMetaData filterFileMetaDataByMidpoint(FileMetaData metaData, > RangeMetadataFilter filter) { > > > Since parquet-java is one of the main sources of truth for Parquet > readers > and writers, I strongly suspect that there are some misunderstandings > within the community. > > FWIW I think the C/C++ implementation (via pyarrow/python bindings) is also > very widely used, and parquet-java can't be used as a test for what data > will be found in Parquet files. > > > > > > > On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev < > [email protected]> > wrote: > > > Hi Jiayi, > > > > yes I think you're right: the current split code has a few checks that > > would fail on some real parquet files I have that aren't laid out > > sequentially but still, by my reading, follow the letter of the spec. > > > > On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote: > > > > > Hi Will, > > > > > > Thanks for replying. > > > > Using a single address to represent each rowgroup (whether that is > the > > > true 'midpoint' or not) guarantees that each rowgroup is only in one > > split. > > > > > > This is my understanding as well. However, things become a bit more > > > confusing when it comes to the optional `RowGroup.file_offset` field: > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577 > > > > > > As far as I understand, some versions of parquet-java used to emit > > > incorrect file_offset values for row groups, and parquet-java uses this > > > function to invalidate those offsets. > > > > > > The function appears to rely on the two assumptions I mentioned > earlier: > > > > > > 1. Column chunks are physically stored in schema order, so the first > > > column in RowGroup.columns also has the smallest file offset. > > > 2. The column chunks of a row group are stored contiguously; that > is, > > > row groups do not overlap. > > > > > > Since parquet-java is one of the main sources of truth for Parquet > > readers > > > and writers, I strongly suspect that there are some misunderstandings > > > within the community. The specification should be clearer about what is > > > allowed and what is disallowed. > > > Best, > > > Jiayi > > > > > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道: > > > > > > > Hi Jiayi, > > > > > > > > My guess is that the splitting is a parquet-java client feature and > not > > > > part of the spec? > > > > > > > > Using a single address to represent each rowgroup (whether that is > the > > > true > > > > 'midpoint' or not) guarantees that each rowgroup is only in one > split, > > > but > > > > it doesn't guarantee optimal splitting. However, it performs well > when > > > the > > > > assumptions you listed hold. > > > > > > > > I've seen lots of parquet files where the assumptions don't hold. > But > > > > parquet-java's split will perform correctly - if suboptimally - on > > them. > > > > > > > > best, Will > > > > > > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote: > > > > > > > > > Hi all, > > > > > > > > > > While looking into how file splits work, we noticed that > parquet-java > > > > > assigns a row group to a split using the row group’s midpoint > > > > > (parquet-java code > > > > > link > > > > > < > > > > > > > > > > > > > > > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555 > > > > > > > > > > > ): > > > > > > > > > > start = offset of the first column chunk > > > > > > > > > > midpoint = start + total_compressed_size / 2 > > > > > > > > > > This seems to rely on two assumptions: > > > > > > > > > > 1. > > > > > > > > > > Column chunks are physically stored in schema order, so the > first > > > > column > > > > > in RowGroup.columns also has the smallest file offset. > > > > > 2. > > > > > > > > > > The column chunks of a row group are stored contiguously (i.e. > row > > > > > groups are not overlapping) > > > > > > > > > > Are these requirements guaranteed by the Parquet format > > specification, > > > or > > > > > are they only conventions followed by common writers? > > > > > > > > > > Based on our reading, the Parquet specification only requires the > > > column > > > > > metadata to follow the same order as the SchemaElement list in > > > > > FileMetaData. > > > > > It does not appear to require the corresponding file offsets to be > in > > > > > ascending order, nor does it explicitly prohibit row groups from > > > > > overlapping. > > > > > > > > > > Is that the correct interpretation? Either way, it would be helpful > > for > > > > the > > > > > specification to clarify these requirements and remove the > ambiguity. > > > > Best, > > > > > Jiayi > > > > > > > > > > > > > > >
