Re: Clarification on row-group and column-chunk layout

Ed Seidl Wed, 17 Jun 2026 08:33:32 -0700

TIL parquet-java requires that the offset of the first page/column chunk must 
be 4. That can be added to your list of assumptions not explicitly spelled out 
in the spec ;-)


I agree that these things should be made clearer. Especially since the README 
still states that column chunks can reside in separate files [1].

Cheers,
Ed

[1] https://github.com/apache/parquet-format#separating-metadata-and-column-data

On 2026/06/17 08:16:34 Jiayi Wang wrote:
> Hi Will,
> 
> Thanks for replying.
> > Using a single address to represent each rowgroup (whether that is the
> true 'midpoint' or not) guarantees that each rowgroup is only in one split.
> 
> This is my understanding as well. However, things become a bit more
> confusing when it comes to the optional `RowGroup.file_offset` field:
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
> 
> As far as I understand, some versions of parquet-java used to emit
> incorrect file_offset values for row groups, and parquet-java uses this
> function to invalidate those offsets.
> 
> The function appears to rely on the two assumptions I mentioned earlier:
> 
>    1. Column chunks are physically stored in schema order, so the first
>    column in RowGroup.columns also has the smallest file offset.
>    2. The column chunks of a row group are stored contiguously; that is,
>    row groups do not overlap.
> 
> Since parquet-java is one of the main sources of truth for Parquet readers
> and writers, I strongly suspect that there are some misunderstandings
> within the community. The specification should be clearer about what is
> allowed and what is disallowed.
> Best,
> Jiayi
> 
> Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道：
> 
> > Hi Jiayi,
> >
> > My guess is that the splitting is a parquet-java client feature and not
> > part of the spec?
> >
> > Using a single address to represent each rowgroup (whether that is the true
> > 'midpoint' or not) guarantees that each rowgroup is only in one split, but
> > it doesn't guarantee optimal splitting.  However, it performs well when the
> > assumptions you listed hold.
> >
> > I've seen lots of parquet files where the assumptions don't hold.  But
> > parquet-java's split will perform correctly - if suboptimally - on them.
> >
> > best, Will
> >
> > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > While looking into how file splits work, we noticed that parquet-java
> > > assigns a row group to a split using the row group’s midpoint
> > > (parquet-java code
> > > link
> > > <
> > >
> > https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > >
> > > ):
> > >
> > > start = offset of the first column chunk
> > >
> > > midpoint = start + total_compressed_size / 2
> > >
> > > This seems to rely on two assumptions:
> > >
> > >    1.
> > >
> > >    Column chunks are physically stored in schema order, so the first
> > column
> > >    in RowGroup.columns also has the smallest file offset.
> > >    2.
> > >
> > >    The column chunks of a row group are stored contiguously (i.e. row
> > >    groups are not overlapping)
> > >
> > > Are these requirements guaranteed by the Parquet format specification, or
> > > are they only conventions followed by common writers?
> > >
> > > Based on our reading, the Parquet specification only requires the column
> > > metadata to follow the same order as the SchemaElement list in
> > > FileMetaData.
> > > It does not appear to require the corresponding file offsets to be in
> > > ascending order, nor does it explicitly prohibit row groups from
> > > overlapping.
> > >
> > > Is that the correct interpretation? Either way, it would be helpful for
> > the
> > > specification to clarify these requirements and remove the ambiguity.
> > Best,
> > > Jiayi
> > >
> >
>

Re: Clarification on row-group and column-chunk layout

Reply via email to