Hi Jiayi,

yes I think you're right: the current split code has a few checks that
would fail on some real parquet files I have that aren't laid out
sequentially but still, by my reading, follow the letter of the spec.

On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote:

> Hi Will,
>
> Thanks for replying.
> > Using a single address to represent each rowgroup (whether that is the
> true 'midpoint' or not) guarantees that each rowgroup is only in one split.
>
> This is my understanding as well. However, things become a bit more
> confusing when it comes to the optional `RowGroup.file_offset` field:
>
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
>
> As far as I understand, some versions of parquet-java used to emit
> incorrect file_offset values for row groups, and parquet-java uses this
> function to invalidate those offsets.
>
> The function appears to rely on the two assumptions I mentioned earlier:
>
>    1. Column chunks are physically stored in schema order, so the first
>    column in RowGroup.columns also has the smallest file offset.
>    2. The column chunks of a row group are stored contiguously; that is,
>    row groups do not overlap.
>
> Since parquet-java is one of the main sources of truth for Parquet readers
> and writers, I strongly suspect that there are some misunderstandings
> within the community. The specification should be clearer about what is
> allowed and what is disallowed.
> Best,
> Jiayi
>
> Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道:
>
> > Hi Jiayi,
> >
> > My guess is that the splitting is a parquet-java client feature and not
> > part of the spec?
> >
> > Using a single address to represent each rowgroup (whether that is the
> true
> > 'midpoint' or not) guarantees that each rowgroup is only in one split,
> but
> > it doesn't guarantee optimal splitting.  However, it performs well when
> the
> > assumptions you listed hold.
> >
> > I've seen lots of parquet files where the assumptions don't hold.  But
> > parquet-java's split will perform correctly - if suboptimally - on them.
> >
> > best, Will
> >
> > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > While looking into how file splits work, we noticed that parquet-java
> > > assigns a row group to a split using the row group’s midpoint
> > > (parquet-java code
> > > link
> > > <
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > >
> > > ):
> > >
> > > start = offset of the first column chunk
> > >
> > > midpoint = start + total_compressed_size / 2
> > >
> > > This seems to rely on two assumptions:
> > >
> > >    1.
> > >
> > >    Column chunks are physically stored in schema order, so the first
> > column
> > >    in RowGroup.columns also has the smallest file offset.
> > >    2.
> > >
> > >    The column chunks of a row group are stored contiguously (i.e. row
> > >    groups are not overlapping)
> > >
> > > Are these requirements guaranteed by the Parquet format specification,
> or
> > > are they only conventions followed by common writers?
> > >
> > > Based on our reading, the Parquet specification only requires the
> column
> > > metadata to follow the same order as the SchemaElement list in
> > > FileMetaData.
> > > It does not appear to require the corresponding file offsets to be in
> > > ascending order, nor does it explicitly prohibit row groups from
> > > overlapping.
> > >
> > > Is that the correct interpretation? Either way, it would be helpful for
> > the
> > > specification to clarify these requirements and remove the ambiguity.
> > Best,
> > > Jiayi
> > >
> >
>

Reply via email to