Jan, your understanding of the Parquet spec is correct.
The semantics of "num_rows" and "first_row_index" do require records to
*not* be split across pages.
Push downs and page skipping require this to be true.
I would consider the behavior of splitting a record across pages as a bug
in pyarrow.parquet.
I'd support updating the spec to have stronger language if you think it is
necessary.

On Fri, May 10, 2024 at 11:36 AM Andrew Lamb <[email protected]> wrote:

> We encountered a similar question / issue in the Rust parquet
> implementation[1].
>
> Raphael's conclusion was that pages need to start with r-level 0 if using
> V2 data pages or if there is a page index. Among other reasons, if this
> doesn't hold, it is not possible to do pushdown on nested columns as you
> have no idea where the last record actually ends.
>
> We updated the parquet-rs reader to make this assumption in [2]
>
> If others on this thread agree I would be happy to draft a spec
> clarification on this point
>
> Andrew
>
>
>
>
>
> [1] https://github.com/apache/arrow-rs/issues/3680
> [2] https://github.com/apache/arrow-rs/pull/4943
>
>
>
> On Fri, May 10, 2024 at 1:15 PM Jan Finis <[email protected]> wrote:
>
> > Hey Parquet devs,
> >
> > I so far thought that Parquet mandates that records start at page
> > boundaries, i.e., at r-level 0, and we have relied on this fact in some
> > places of our engine. That means, there cannot be any data page for a
> > REPEATED column that starts at an r-level > 0, as this would mean that a
> > record would be split between multiple pages.
> >
> > I also found the two comments in parquet.thrift:
> >
> >   /** Number of rows in this data page. which means pages change on
> record
> > > boundaries (r = 0) **/
> > >   3: required i32 num_rows
> >
> >
> >   /**
> > >    * Index within the RowGroup of the first row of the page; this means
> > > pages
> > >    * change on record boundaries (r = 0).
> > >    */
> > >   3: required i64 first_row_index
> >
> >
> > These comments seem to imply that my understanding is correct. However,
> > they are worded very weakly, not like a mandate but more like a "by the
> > way" comment.
> >
> > I haven't found any other mention of r-levels and page boundaries in the
> > parquet-format repo (maybe I missed them?).
> >
> > I recently noticed that pyarrow.parquet splits repeated fields over
> > multiple pages, so it violates this. This triggers assertions in our
> > engine, so I want to understand what's the right course of action here.
> >
> > So, can we please clarify:
> > *Does Parquet mandate that pages need to start at r-level 0?*
> >
> >    - I.e., is a parquet file with a page that starts at an r-level > 0
> ill
> >    formed? I.e., is this a bug in pyarrow.parquet?
> >    - Or can pages start at r-level 0? If so, then what is the
> significance
> >    of the comments in parquet.thrift?
> >
> >
> > Cheers,
> > Jan
> >
>

Reply via email to