> That is exactly why I think the specification
> should explicitly distinguish writer conventions from format guarantees.

I agree

> Happy to file a parquet-format PR with proposed wording if that's the
right next step.

Sounds like a good idea to me. Thank you!

Andrew

On Wed, Jun 17, 2026 at 12:23 PM Jiayi Wang <[email protected]> wrote:

> Hi Andrew,
>
> Good questions.
>
> "Split" here is the Hadoop `FileSplit` concept, not anything in the Parquet
> spec: engines like Spark divide a file into byte ranges for parallel tasks,
> and parquet-java assigns each row group to a split based on the row group’s
> midpoint (start + total_compressed_size/2).
>
> The "// Visible for testing" note just widens the method's visibility, the
> midpoint logic is the real split-assignment code. My point is that it
> silently assumes (1) column 0 has the smallest offset and (2) row groups
> don't overlap, (3) As Ed pointed out, the first column chunk starts at
> offset 4, neither of which the spec requires.
>
> I also checked the Arrow C++ read path, which makes none of these
> assumptions. It locates each column chunk independently using its own
> data_page_offset and dictionary_page_offset, never reads the
> file_offset fields,
> and splits work purely by row-group index, without any midpoint
> calculation.
>
> So we already have two major implementations that disagree on what they
> assume about physical layout. That is exactly why I think the specification
> should explicitly distinguish writer conventions from format guarantees.
> Happy
> to file a parquet-format PR with proposed wording if that's the right next
> step.
>
> Best,
> Jiayi
>
> Andrew Lamb <[email protected]> 于2026年6月17日周三 16:00写道:
>
> > > While looking into how file splits work, we noticed that parquet-java
> > > assigns a row group to a split using the row group’s midpoint
> >
> > Could you be clearer what "split" means in parquet-java and if it is a
> Java
> > specific feature or something that is defined in the parquet spec? I
> don't
> > quite understand if this is something specific to parquet-java or parquet
> > as a whole.
> >
> > The link you provided
> >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> >
> > Goes to a function that implies it is for testing
> >
> > > // Visible for testing
> > >   static FileMetaData filterFileMetaDataByMidpoint(FileMetaData
> metaData,
> > RangeMetadataFilter filter) {
> >
> > > Since parquet-java is one of the main sources of truth for Parquet
> > readers
> > and writers, I strongly suspect that there are some misunderstandings
> > within the community.
> >
> > FWIW I think the C/C++ implementation (via pyarrow/python bindings) is
> also
> > very widely used, and parquet-java can't be used as a test for what data
> > will be found in Parquet files.
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 17, 2026 at 7:02 AM Will Edwards via dev <
> > [email protected]>
> > wrote:
> >
> > > Hi Jiayi,
> > >
> > > yes I think you're right: the current split code has a few checks that
> > > would fail on some real parquet files I have that aren't laid out
> > > sequentially but still, by my reading, follow the letter of the spec.
> > >
> > > On Wed, 17 Jun 2026 at 10:17, Jiayi Wang <[email protected]> wrote:
> > >
> > > > Hi Will,
> > > >
> > > > Thanks for replying.
> > > > > Using a single address to represent each rowgroup (whether that is
> > the
> > > > true 'midpoint' or not) guarantees that each rowgroup is only in one
> > > split.
> > > >
> > > > This is my understanding as well. However, things become a bit more
> > > > confusing when it comes to the optional `RowGroup.file_offset` field:
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1559-L1577
> > > >
> > > > As far as I understand, some versions of parquet-java used to emit
> > > > incorrect file_offset values for row groups, and parquet-java uses
> this
> > > > function to invalidate those offsets.
> > > >
> > > > The function appears to rely on the two assumptions I mentioned
> > earlier:
> > > >
> > > >    1. Column chunks are physically stored in schema order, so the
> first
> > > >    column in RowGroup.columns also has the smallest file offset.
> > > >    2. The column chunks of a row group are stored contiguously; that
> > is,
> > > >    row groups do not overlap.
> > > >
> > > > Since parquet-java is one of the main sources of truth for Parquet
> > > readers
> > > > and writers, I strongly suspect that there are some misunderstandings
> > > > within the community. The specification should be clearer about what
> is
> > > > allowed and what is disallowed.
> > > > Best,
> > > > Jiayi
> > > >
> > > > Will Edwards via dev <[email protected]> 于2026年6月16日周二 16:55写道:
> > > >
> > > > > Hi Jiayi,
> > > > >
> > > > > My guess is that the splitting is a parquet-java client feature and
> > not
> > > > > part of the spec?
> > > > >
> > > > > Using a single address to represent each rowgroup (whether that is
> > the
> > > > true
> > > > > 'midpoint' or not) guarantees that each rowgroup is only in one
> > split,
> > > > but
> > > > > it doesn't guarantee optimal splitting.  However, it performs well
> > when
> > > > the
> > > > > assumptions you listed hold.
> > > > >
> > > > > I've seen lots of parquet files where the assumptions don't hold.
> > But
> > > > > parquet-java's split will perform correctly - if suboptimally - on
> > > them.
> > > > >
> > > > > best, Will
> > > > >
> > > > > On Tue, 16 Jun 2026 at 16:03, Jiayi Wang <[email protected]>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > While looking into how file splits work, we noticed that
> > parquet-java
> > > > > > assigns a row group to a split using the row group’s midpoint
> > > > > > (parquet-java code
> > > > > > link
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1504-L1555
> > > > > > >
> > > > > > ):
> > > > > >
> > > > > > start = offset of the first column chunk
> > > > > >
> > > > > > midpoint = start + total_compressed_size / 2
> > > > > >
> > > > > > This seems to rely on two assumptions:
> > > > > >
> > > > > >    1.
> > > > > >
> > > > > >    Column chunks are physically stored in schema order, so the
> > first
> > > > > column
> > > > > >    in RowGroup.columns also has the smallest file offset.
> > > > > >    2.
> > > > > >
> > > > > >    The column chunks of a row group are stored contiguously (i.e.
> > row
> > > > > >    groups are not overlapping)
> > > > > >
> > > > > > Are these requirements guaranteed by the Parquet format
> > > specification,
> > > > or
> > > > > > are they only conventions followed by common writers?
> > > > > >
> > > > > > Based on our reading, the Parquet specification only requires the
> > > > column
> > > > > > metadata to follow the same order as the SchemaElement list in
> > > > > > FileMetaData.
> > > > > > It does not appear to require the corresponding file offsets to
> be
> > in
> > > > > > ascending order, nor does it explicitly prohibit row groups from
> > > > > > overlapping.
> > > > > >
> > > > > > Is that the correct interpretation? Either way, it would be
> helpful
> > > for
> > > > > the
> > > > > > specification to clarify these requirements and remove the
> > ambiguity.
> > > > > Best,
> > > > > > Jiayi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to