> While looking into how file splits work, we noticed that parquet-java
> assigns a row group to a split using the row group’s midpoint

Could you be clearer what "split" means in parquet-java and if it is a Java
specific feature or something that is defined in the parquet spec? I don't
quite understand if this is something specific to parquet-java or parquet
as a whole.

The link you provided
https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555

Goes to a function that implies it is for testing

> // Visible for testing
>   static FileMetaData filterFileMetaDataByMidpoint(FileMetaData metaData,
RangeMetadataFilter filter) {

> Since parquet-java is one of the main sources of truth for Parquet readers
and writers, I strongly suspect that there are some misunderstandings
within the community.

FWIW I think the C/C++ implementation (via pyarrow/python bindings) is also
very widely used, and parquet-java can't be used as a test for what data
will be found in Parquet files.






On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev <[email protected]>
wrote:

> Hi Jiayi,
>
> yes I think you're right: the current split code has a few checks that
> would fail on some real parquet files I have that aren't laid out
> sequentially but still, by my reading, follow the letter of the spec.
>
> On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote:
>
> > Hi Will,
> >
> > Thanks for replying.
> > > Using a single address to represent each rowgroup (whether that is the
> > true 'midpoint' or not) guarantees that each rowgroup is only in one
> split.
> >
> > This is my understanding as well. However, things become a bit more
> > confusing when it comes to the optional `RowGroup.file_offset` field:
> >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
> >
> > As far as I understand, some versions of parquet-java used to emit
> > incorrect file_offset values for row groups, and parquet-java uses this
> > function to invalidate those offsets.
> >
> > The function appears to rely on the two assumptions I mentioned earlier:
> >
> >    1. Column chunks are physically stored in schema order, so the first
> >    column in RowGroup.columns also has the smallest file offset.
> >    2. The column chunks of a row group are stored contiguously; that is,
> >    row groups do not overlap.
> >
> > Since parquet-java is one of the main sources of truth for Parquet
> readers
> > and writers, I strongly suspect that there are some misunderstandings
> > within the community. The specification should be clearer about what is
> > allowed and what is disallowed.
> > Best,
> > Jiayi
> >
> > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道:
> >
> > > Hi Jiayi,
> > >
> > > My guess is that the splitting is a parquet-java client feature and not
> > > part of the spec?
> > >
> > > Using a single address to represent each rowgroup (whether that is the
> > true
> > > 'midpoint' or not) guarantees that each rowgroup is only in one split,
> > but
> > > it doesn't guarantee optimal splitting.  However, it performs well when
> > the
> > > assumptions you listed hold.
> > >
> > > I've seen lots of parquet files where the assumptions don't hold.  But
> > > parquet-java's split will perform correctly - if suboptimally - on
> them.
> > >
> > > best, Will
> > >
> > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > While looking into how file splits work, we noticed that parquet-java
> > > > assigns a row group to a split using the row group’s midpoint
> > > > (parquet-java code
> > > > link
> > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > > >
> > > > ):
> > > >
> > > > start = offset of the first column chunk
> > > >
> > > > midpoint = start + total_compressed_size / 2
> > > >
> > > > This seems to rely on two assumptions:
> > > >
> > > >    1.
> > > >
> > > >    Column chunks are physically stored in schema order, so the first
> > > column
> > > >    in RowGroup.columns also has the smallest file offset.
> > > >    2.
> > > >
> > > >    The column chunks of a row group are stored contiguously (i.e. row
> > > >    groups are not overlapping)
> > > >
> > > > Are these requirements guaranteed by the Parquet format
> specification,
> > or
> > > > are they only conventions followed by common writers?
> > > >
> > > > Based on our reading, the Parquet specification only requires the
> > column
> > > > metadata to follow the same order as the SchemaElement list in
> > > > FileMetaData.
> > > > It does not appear to require the corresponding file offsets to be in
> > > > ascending order, nor does it explicitly prohibit row groups from
> > > > overlapping.
> > > >
> > > > Is that the correct interpretation? Either way, it would be helpful
> for
> > > the
> > > > specification to clarify these requirements and remove the ambiguity.
> > > Best,
> > > > Jiayi
> > > >
> > >
> >
>

Reply via email to