Hi Andrew,

Good questions.

"Split" here is the Hadoop `FileSplit` concept, not anything in the Parquet
spec: engines like Spark divide a file into byte ranges for parallel tasks,
and parquet-java assigns each row group to a split based on the row group’s
midpoint (start + total_compressed_size/2).

The "// Visible for testing" note just widens the method's visibility, the
midpoint logic is the real split-assignment code. My point is that it
silently assumes (1) column 0 has the smallest offset and (2) row groups
don't overlap, (3) As Ed pointed out, the first column chunk starts at
offset 4, neither of which the spec requires.

I also checked the Arrow C++ read path, which makes none of these
assumptions. It locates each column chunk independently using its own
data_page_offset and dictionary_page_offset, never reads the
file_offset fields,
and splits work purely by row-group index, without any midpoint calculation.

So we already have two major implementations that disagree on what they
assume about physical layout. That is exactly why I think the specification
should explicitly distinguish writer conventions from format guarantees. Happy
to file a parquet-format PR with proposed wording if that's the right next
step.

Best,
Jiayi

Andrew Lamb <[email protected]> 于2026年6月17日周三 16:00写道:

> > While looking into how file splits work, we noticed that parquet-java
> > assigns a row group to a split using the row group’s midpoint
>
> Could you be clearer what "split" means in parquet-java and if it is a Java
> specific feature or something that is defined in the parquet spec? I don't
> quite understand if this is something specific to parquet-java or parquet
> as a whole.
>
> The link you provided
>
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
>
> Goes to a function that implies it is for testing
>
> > // Visible for testing
> >   static FileMetaData filterFileMetaDataByMidpoint(FileMetaData metaData,
> RangeMetadataFilter filter) {
>
> > Since parquet-java is one of the main sources of truth for Parquet
> readers
> and writers, I strongly suspect that there are some misunderstandings
> within the community.
>
> FWIW I think the C/C++ implementation (via pyarrow/python bindings) is also
> very widely used, and parquet-java can't be used as a test for what data
> will be found in Parquet files.
>
>
>
>
>
>
> On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev <
> [email protected]>
> wrote:
>
> > Hi Jiayi,
> >
> > yes I think you're right: the current split code has a few checks that
> > would fail on some real parquet files I have that aren't laid out
> > sequentially but still, by my reading, follow the letter of the spec.
> >
> > On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote:
> >
> > > Hi Will,
> > >
> > > Thanks for replying.
> > > > Using a single address to represent each rowgroup (whether that is
> the
> > > true 'midpoint' or not) guarantees that each rowgroup is only in one
> > split.
> > >
> > > This is my understanding as well. However, things become a bit more
> > > confusing when it comes to the optional `RowGroup.file_offset` field:
> > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
> > >
> > > As far as I understand, some versions of parquet-java used to emit
> > > incorrect file_offset values for row groups, and parquet-java uses this
> > > function to invalidate those offsets.
> > >
> > > The function appears to rely on the two assumptions I mentioned
> earlier:
> > >
> > >    1. Column chunks are physically stored in schema order, so the first
> > >    column in RowGroup.columns also has the smallest file offset.
> > >    2. The column chunks of a row group are stored contiguously; that
> is,
> > >    row groups do not overlap.
> > >
> > > Since parquet-java is one of the main sources of truth for Parquet
> > readers
> > > and writers, I strongly suspect that there are some misunderstandings
> > > within the community. The specification should be clearer about what is
> > > allowed and what is disallowed.
> > > Best,
> > > Jiayi
> > >
> > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道:
> > >
> > > > Hi Jiayi,
> > > >
> > > > My guess is that the splitting is a parquet-java client feature and
> not
> > > > part of the spec?
> > > >
> > > > Using a single address to represent each rowgroup (whether that is
> the
> > > true
> > > > 'midpoint' or not) guarantees that each rowgroup is only in one
> split,
> > > but
> > > > it doesn't guarantee optimal splitting.  However, it performs well
> when
> > > the
> > > > assumptions you listed hold.
> > > >
> > > > I've seen lots of parquet files where the assumptions don't hold.
> But
> > > > parquet-java's split will perform correctly - if suboptimally - on
> > them.
> > > >
> > > > best, Will
> > > >
> > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > While looking into how file splits work, we noticed that
> parquet-java
> > > > > assigns a row group to a split using the row group’s midpoint
> > > > > (parquet-java code
> > > > > link
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > > > >
> > > > > ):
> > > > >
> > > > > start = offset of the first column chunk
> > > > >
> > > > > midpoint = start + total_compressed_size / 2
> > > > >
> > > > > This seems to rely on two assumptions:
> > > > >
> > > > >    1.
> > > > >
> > > > >    Column chunks are physically stored in schema order, so the
> first
> > > > column
> > > > >    in RowGroup.columns also has the smallest file offset.
> > > > >    2.
> > > > >
> > > > >    The column chunks of a row group are stored contiguously (i.e.
> row
> > > > >    groups are not overlapping)
> > > > >
> > > > > Are these requirements guaranteed by the Parquet format
> > specification,
> > > or
> > > > > are they only conventions followed by common writers?
> > > > >
> > > > > Based on our reading, the Parquet specification only requires the
> > > column
> > > > > metadata to follow the same order as the SchemaElement list in
> > > > > FileMetaData.
> > > > > It does not appear to require the corresponding file offsets to be
> in
> > > > > ascending order, nor does it explicitly prohibit row groups from
> > > > > overlapping.
> > > > >
> > > > > Is that the correct interpretation? Either way, it would be helpful
> > for
> > > > the
> > > > > specification to clarify these requirements and remove the
> ambiguity.
> > > > Best,
> > > > > Jiayi
> > > > >
> > > >
> > >
> >
>

Reply via email to