Hi,

Regarding:
> Column chunks are physically stored in schema order, so the first column
in RowGroup.columns also has the smallest file offset.

I believe this is not required by the parquet specification. I encountered
this same question while writing a parquet driver (
https://github.com/Earnix/parquetforge):

The parquet-format spec requires the ColumnChunk order in the footer to
match the schema order. That's documented here:
https://github.com/apache/parquet-format/blob/1dbc814b97c9307687a2e4bee55545ab6a2ef106/src/main/thrift/parquet.thrift#L1002
In the JIRA ticket linked to the commit message (now migrated to GH) -
#1734 it seems that at least impala accepts columns written in arbitrary
orders, and ordering the Dict/DataPages of a column chunk is not required.
With this said, be aware that this can adversely affect performance. See:
https://dl.acm.org/doi/10.1145/3035918.3035930

See also this thread:
https://github.com/apache/parquet-java/pull/1273/changes#r1536748360

> The column chunks of a row group are stored contiguously; that is, row
groups do not overlap.

I don't have a reference for this, but it seems logical that many parquet
drivers would fail to read a file where row groups are not contiguous.
Exactly for the use case where they are split up and shared as Jiayi
explained. This is why row group target size is recommended to be .5-1GB
here - https://parquet.apache.org/docs/file-format/configurations/

Andrew Pikler


On Wed, Jun 17, 2026 at 7:23 PM Jiayi Wang <[email protected]> wrote:

> Hi Andrew,
>
> Good questions.
>
> "Split" here is the Hadoop `FileSplit` concept, not anything in the Parquet
> spec: engines like Spark divide a file into byte ranges for parallel tasks,
> and parquet-java assigns each row group to a split based on the row group’s
> midpoint (start + total_compressed_size/2).
>
> The "// Visible for testing" note just widens the method's visibility, the
> midpoint logic is the real split-assignment code. My point is that it
> silently assumes (1) column 0 has the smallest offset and (2) row groups
> don't overlap, (3) As Ed pointed out, the first column chunk starts at
> offset 4, neither of which the spec requires.
>
> I also checked the Arrow C++ read path, which makes none of these
> assumptions. It locates each column chunk independently using its own
> data_page_offset and dictionary_page_offset, never reads the
> file_offset fields,
> and splits work purely by row-group index, without any midpoint
> calculation.
>
> So we already have two major implementations that disagree on what they
> assume about physical layout. That is exactly why I think the specification
> should explicitly distinguish writer conventions from format guarantees.
> Happy
> to file a parquet-format PR with proposed wording if that's the right next
> step.
>
> Best,
> Jiayi
>
> Andrew Lamb <[email protected]> 于2026年6月17日周三 16:00写道:
>
> > > While looking into how file splits work, we noticed that parquet-java
> > > assigns a row group to a split using the row group’s midpoint
> >
> > Could you be clearer what "split" means in parquet-java and if it is a
> Java
> > specific feature or something that is defined in the parquet spec? I
> don't
> > quite understand if this is something specific to parquet-java or parquet
> > as a whole.
> >
> > The link you provided
> >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> >
> > Goes to a function that implies it is for testing
> >
> > > // Visible for testing
> > >   static FileMetaData filterFileMetaDataByMidpoint(FileMetaData
> metaData,
> > RangeMetadataFilter filter) {
> >
> > > Since parquet-java is one of the main sources of truth for Parquet
> > readers
> > and writers, I strongly suspect that there are some misunderstandings
> > within the community.
> >
> > FWIW I think the C/C++ implementation (via pyarrow/python bindings) is
> also
> > very widely used, and parquet-java can't be used as a test for what data
> > will be found in Parquet files.
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev <
> > [email protected]>
> > wrote:
> >
> > > Hi Jiayi,
> > >
> > > yes I think you're right: the current split code has a few checks that
> > > would fail on some real parquet files I have that aren't laid out
> > > sequentially but still, by my reading, follow the letter of the spec.
> > >
> > > On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote:
> > >
> > > > Hi Will,
> > > >
> > > > Thanks for replying.
> > > > > Using a single address to represent each rowgroup (whether that is
> > the
> > > > true 'midpoint' or not) guarantees that each rowgroup is only in one
> > > split.
> > > >
> > > > This is my understanding as well. However, things become a bit more
> > > > confusing when it comes to the optional `RowGroup.file_offset` field:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
> > > >
> > > > As far as I understand, some versions of parquet-java used to emit
> > > > incorrect file_offset values for row groups, and parquet-java uses
> this
> > > > function to invalidate those offsets.
> > > >
> > > > The function appears to rely on the two assumptions I mentioned
> > earlier:
> > > >
> > > >    1. Column chunks are physically stored in schema order, so the
> first
> > > >    column in RowGroup.columns also has the smallest file offset.
> > > >    2. The column chunks of a row group are stored contiguously; that
> > is,
> > > >    row groups do not overlap.
> > > >
> > > > Since parquet-java is one of the main sources of truth for Parquet
> > > readers
> > > > and writers, I strongly suspect that there are some misunderstandings
> > > > within the community. The specification should be clearer about what
> is
> > > > allowed and what is disallowed.
> > > > Best,
> > > > Jiayi
> > > >
> > > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道:
> > > >
> > > > > Hi Jiayi,
> > > > >
> > > > > My guess is that the splitting is a parquet-java client feature and
> > not
> > > > > part of the spec?
> > > > >
> > > > > Using a single address to represent each rowgroup (whether that is
> > the
> > > > true
> > > > > 'midpoint' or not) guarantees that each rowgroup is only in one
> > split,
> > > > but
> > > > > it doesn't guarantee optimal splitting.  However, it performs well
> > when
> > > > the
> > > > > assumptions you listed hold.
> > > > >
> > > > > I've seen lots of parquet files where the assumptions don't hold.
> > But
> > > > > parquet-java's split will perform correctly - if suboptimally - on
> > > them.
> > > > >
> > > > > best, Will
> > > > >
> > > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > While looking into how file splits work, we noticed that
> > parquet-java
> > > > > > assigns a row group to a split using the row group’s midpoint
> > > > > > (parquet-java code
> > > > > > link
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > > > > >
> > > > > > ):
> > > > > >
> > > > > > start = offset of the first column chunk
> > > > > >
> > > > > > midpoint = start + total_compressed_size / 2
> > > > > >
> > > > > > This seems to rely on two assumptions:
> > > > > >
> > > > > >    1.
> > > > > >
> > > > > >    Column chunks are physically stored in schema order, so the
> > first
> > > > > column
> > > > > >    in RowGroup.columns also has the smallest file offset.
> > > > > >    2.
> > > > > >
> > > > > >    The column chunks of a row group are stored contiguously (i.e.
> > row
> > > > > >    groups are not overlapping)
> > > > > >
> > > > > > Are these requirements guaranteed by the Parquet format
> > > specification,
> > > > or
> > > > > > are they only conventions followed by common writers?
> > > > > >
> > > > > > Based on our reading, the Parquet specification only requires the
> > > > column
> > > > > > metadata to follow the same order as the SchemaElement list in
> > > > > > FileMetaData.
> > > > > > It does not appear to require the corresponding file offsets to
> be
> > in
> > > > > > ascending order, nor does it explicitly prohibit row groups from
> > > > > > overlapping.
> > > > > >
> > > > > > Is that the correct interpretation? Either way, it would be
> helpful
> > > for
> > > > > the
> > > > > > specification to clarify these requirements and remove the
> > ambiguity.
> > > > > Best,
> > > > > > Jiayi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to