Re: Clarification on row-group and column-chunk layout

Will Edwards via dev Tue, 16 Jun 2026 07:54:15 -0700

Hi Jiayi,

My guess is that the splitting is a parquet-java client feature and not
part of the spec?


Using a single address to represent each rowgroup (whether that is the true
'midpoint' or not) guarantees that each rowgroup is only in one split, but
it doesn't guarantee optimal splitting.  However, it performs well when the
assumptions you listed hold.

I've seen lots of parquet files where the assumptions don't hold.  But
parquet-java's split will perform correctly - if suboptimally - on them.

best, Will

On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]> wrote:

> Hi all,
>
> While looking into how file splits work, we noticed that parquet-java
> assigns a row group to a split using the row group’s midpoint
> (parquet-java code
> link
> <
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> >
> ):
>
> start = offset of the first column chunk
>
> midpoint = start + total_compressed_size / 2
>
> This seems to rely on two assumptions:
>
>    1.
>
>    Column chunks are physically stored in schema order, so the first column
>    in RowGroup.columns also has the smallest file offset.
>    2.
>
>    The column chunks of a row group are stored contiguously (i.e. row
>    groups are not overlapping)
>
> Are these requirements guaranteed by the Parquet format specification, or
> are they only conventions followed by common writers?
>
> Based on our reading, the Parquet specification only requires the column
> metadata to follow the same order as the SchemaElement list in
> FileMetaData.
> It does not appear to require the corresponding file offsets to be in
> ascending order, nor does it explicitly prohibit row groups from
> overlapping.
>
> Is that the correct interpretation? Either way, it would be helpful for the
> specification to clarify these requirements and remove the ambiguity. Best,
> Jiayi
>

Re: Clarification on row-group and column-chunk layout

Reply via email to