I have created a PR[1] to the spec to try and encode this mailing list
conversation and avoid future confusion.  Please have a look and let me
know if it captures it correctly.

Thanks,
Andrew

[1]: https://github.com/apache/parquet-format/pull/244

On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <jul...@apache.org> wrote:

> +1 The semantics of a row group is that it contains rows and therefore
> starts on R=0
> I generally echo Ed's sentiment here.
>
> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
>
> > Thank you all -- I have filed
> > https://issues.apache.org/jira/browse/PARQUET-2473 to track clarifying
> the
> > spec and will make a PR shortly
> >
> >
> > On Sun, May 12, 2024 at 12:18 AM wish maple <maplewish...@gmail.com>
> > wrote:
> >
> > > IMO when Page V2 is present or PageIndex is enabled, the boundaries
> > > should be check[1]
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
> > >
> > >
> > > Jan Finis <jpfi...@gmail.com> 于2024年5月11日周六 01:15写道:
> > >
> > > > Hey Parquet devs,
> > > >
> > > > I so far thought that Parquet mandates that records start at page
> > > > boundaries, i.e., at r-level 0, and we have relied on this fact in
> some
> > > > places of our engine. That means, there cannot be any data page for a
> > > > REPEATED column that starts at an r-level > 0, as this would mean
> that
> > a
> > > > record would be split between multiple pages.
> > > >
> > > > I also found the two comments in parquet.thrift:
> > > >
> > > >   /** Number of rows in this data page. which means pages change on
> > > record
> > > > > boundaries (r = 0) **/
> > > > >   3: required i32 num_rows
> > > >
> > > >
> > > >   /**
> > > > >    * Index within the RowGroup of the first row of the page; this
> > means
> > > > > pages
> > > > >    * change on record boundaries (r = 0).
> > > > >    */
> > > > >   3: required i64 first_row_index
> > > >
> > > >
> > > > These comments seem to imply that my understanding is correct.
> However,
> > > > they are worded very weakly, not like a mandate but more like a "by
> the
> > > > way" comment.
> > > >
> > > > I haven't found any other mention of r-levels and page boundaries in
> > the
> > > > parquet-format repo (maybe I missed them?).
> > > >
> > > > I recently noticed that pyarrow.parquet splits repeated fields over
> > > > multiple pages, so it violates this. This triggers assertions in our
> > > > engine, so I want to understand what's the right course of action
> here.
> > > >
> > > > So, can we please clarify:
> > > > *Does Parquet mandate that pages need to start at r-level 0?*
> > > >
> > > >    - I.e., is a parquet file with a page that starts at an r-level >
> 0
> > > ill
> > > >    formed? I.e., is this a bug in pyarrow.parquet?
> > > >    - Or can pages start at r-level 0? If so, then what is the
> > > significance
> > > >    of the comments in parquet.thrift?
> > > >
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > >
> >
>

Reply via email to