Interesting, thanks for the input so far. Since the spec doesn't say this
exactly, let me spin this one step further:

May *row groups* start with an R-Level > 0? Intuitively, I would say "hell
no", but there is nothing in the Parquet spec that would say that this is
forbidden.

Am Fr., 10. Mai 2024 um 12:21 Uhr schrieb Micah Kornfield <
[email protected]>:

> >
> >    - I.e., is a parquet file with a page that starts at an r-level > 0
> ill
> >    formed? I.e., is this a bug in pyarrow.parquet?
>
> As noted above, my understanding is that is only ill-formed if a page index
> is present OR data-page V2 is present.  If neither hold, then I think it is
> a valid parquet file.
>
> This was a long standing bug in Arrow for V2 pages which should have been
> fixed in
>
> https://github.com/apache/arrow/commit/b888f4d6c7dc490ce17b9f64d32af23ffc6f4617
>
> On Fri, May 10, 2024 at 11:48 AM Julien Le Dem
> <[email protected]> wrote:
>
> > Jan, your understanding of the Parquet spec is correct.
> > The semantics of "num_rows" and "first_row_index" do require records to
> > *not* be split across pages.
> > Push downs and page skipping require this to be true.
> > I would consider the behavior of splitting a record across pages as a bug
> > in pyarrow.parquet.
> > I'd support updating the spec to have stronger language if you think it
> is
> > necessary.
> >
> > On Fri, May 10, 2024 at 11:36 AM Andrew Lamb <[email protected]>
> > wrote:
> >
> > > We encountered a similar question / issue in the Rust parquet
> > > implementation[1].
> > >
> > > Raphael's conclusion was that pages need to start with r-level 0 if
> using
> > > V2 data pages or if there is a page index. Among other reasons, if this
> > > doesn't hold, it is not possible to do pushdown on nested columns as
> you
> > > have no idea where the last record actually ends.
> > >
> > > We updated the parquet-rs reader to make this assumption in [2]
> > >
> > > If others on this thread agree I would be happy to draft a spec
> > > clarification on this point
> > >
> > > Andrew
> > >
> > >
> > >
> > >
> > >
> > > [1] https://github.com/apache/arrow-rs/issues/3680
> > > [2] https://github.com/apache/arrow-rs/pull/4943
> > >
> > >
> > >
> > > On Fri, May 10, 2024 at 1:15 PM Jan Finis <[email protected]> wrote:
> > >
> > > > Hey Parquet devs,
> > > >
> > > > I so far thought that Parquet mandates that records start at page
> > > > boundaries, i.e., at r-level 0, and we have relied on this fact in
> some
> > > > places of our engine. That means, there cannot be any data page for a
> > > > REPEATED column that starts at an r-level > 0, as this would mean
> that
> > a
> > > > record would be split between multiple pages.
> > > >
> > > > I also found the two comments in parquet.thrift:
> > > >
> > > >   /** Number of rows in this data page. which means pages change on
> > > record
> > > > > boundaries (r = 0) **/
> > > > >   3: required i32 num_rows
> > > >
> > > >
> > > >   /**
> > > > >    * Index within the RowGroup of the first row of the page; this
> > means
> > > > > pages
> > > > >    * change on record boundaries (r = 0).
> > > > >    */
> > > > >   3: required i64 first_row_index
> > > >
> > > >
> > > > These comments seem to imply that my understanding is correct.
> However,
> > > > they are worded very weakly, not like a mandate but more like a "by
> the
> > > > way" comment.
> > > >
> > > > I haven't found any other mention of r-levels and page boundaries in
> > the
> > > > parquet-format repo (maybe I missed them?).
> > > >
> > > > I recently noticed that pyarrow.parquet splits repeated fields over
> > > > multiple pages, so it violates this. This triggers assertions in our
> > > > engine, so I want to understand what's the right course of action
> here.
> > > >
> > > > So, can we please clarify:
> > > > *Does Parquet mandate that pages need to start at r-level 0?*
> > > >
> > > >    - I.e., is a parquet file with a page that starts at an r-level >
> 0
> > > ill
> > > >    formed? I.e., is this a bug in pyarrow.parquet?
> > > >    - Or can pages start at r-level 0? If so, then what is the
> > > significance
> > > >    of the comments in parquet.thrift?
> > > >
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > >
> >
>

Reply via email to