I'm +1 on this, "Offset Index", "Page Index", "Column Index or Offset
Index" all looks good to me.

Best,
Xuwei Fu

Andrew Lamb <[email protected]> 于2024年5月21日周二 18:07写道:

> mapleFU brought up an excellent question[1].
>
> Upon further research, a "page index" seems to consist of an OffsetIndex
> and ColumnIndex, but some writers may only write OffsetIndex (and not
> ColumnIndex). See discussion on [2]
>
> Thus when we say "repeated fields must start at a page boundary if a page
> index is present OR data-page V2 is present," does that mean:
> 1. an OffsetIndex is present
> 2. both an OffsetIndex and ColumnIndex are present
> 3. Something else
>
> It seems to me that since an OffsetIndex is in terms of numbers of records,
> if it were present that would require repetition_level=0 at page
> boundaries (aka option 1).
>
> Thoughts?
> Andrew
>
>
> [1]
> https://github.com/apache/parquet-format/pull/244#discussion_r1607878045
> [2]: https://github.com/apache/parquet-format/pull/245
>
> On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <[email protected]>
> wrote:
>
> > I have created a PR[1] to the spec to try and encode this mailing list
> > conversation and avoid future confusion.  Please have a look and let me
> > know if it captures it correctly.
> >
> > Thanks,
> > Andrew
> >
> > [1]: https://github.com/apache/parquet-format/pull/244
> >
> > On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <[email protected]> wrote:
> >
> >> +1 The semantics of a row group is that it contains rows and therefore
> >> starts on R=0
> >> I generally echo Ed's sentiment here.
> >>
> >> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <[email protected]>
> >> wrote:
> >>
> >> > Thank you all -- I have filed
> >> > https://issues.apache.org/jira/browse/PARQUET-2473 to track
> clarifying
> >> the
> >> > spec and will make a PR shortly
> >> >
> >> >
> >> > On Sun, May 12, 2024 at 12:18 AM wish maple <[email protected]>
> >> > wrote:
> >> >
> >> > > IMO when Page V2 is present or PageIndex is enabled, the boundaries
> >> > > should be check[1]
> >> > >
> >> > > [1]
> >> > >
> >> > >
> >> >
> >>
> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
> >> > >
> >> > >
> >> > > Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道:
> >> > >
> >> > > > Hey Parquet devs,
> >> > > >
> >> > > > I so far thought that Parquet mandates that records start at page
> >> > > > boundaries, i.e., at r-level 0, and we have relied on this fact in
> >> some
> >> > > > places of our engine. That means, there cannot be any data page
> for
> >> a
> >> > > > REPEATED column that starts at an r-level > 0, as this would mean
> >> that
> >> > a
> >> > > > record would be split between multiple pages.
> >> > > >
> >> > > > I also found the two comments in parquet.thrift:
> >> > > >
> >> > > >   /** Number of rows in this data page. which means pages change
> on
> >> > > record
> >> > > > > boundaries (r = 0) **/
> >> > > > >   3: required i32 num_rows
> >> > > >
> >> > > >
> >> > > >   /**
> >> > > > >    * Index within the RowGroup of the first row of the page;
> this
> >> > means
> >> > > > > pages
> >> > > > >    * change on record boundaries (r = 0).
> >> > > > >    */
> >> > > > >   3: required i64 first_row_index
> >> > > >
> >> > > >
> >> > > > These comments seem to imply that my understanding is correct.
> >> However,
> >> > > > they are worded very weakly, not like a mandate but more like a
> "by
> >> the
> >> > > > way" comment.
> >> > > >
> >> > > > I haven't found any other mention of r-levels and page boundaries
> in
> >> > the
> >> > > > parquet-format repo (maybe I missed them?).
> >> > > >
> >> > > > I recently noticed that pyarrow.parquet splits repeated fields
> over
> >> > > > multiple pages, so it violates this. This triggers assertions in
> our
> >> > > > engine, so I want to understand what's the right course of action
> >> here.
> >> > > >
> >> > > > So, can we please clarify:
> >> > > > *Does Parquet mandate that pages need to start at r-level 0?*
> >> > > >
> >> > > >    - I.e., is a parquet file with a page that starts at an r-level
> >> > 0
> >> > > ill
> >> > > >    formed? I.e., is this a bug in pyarrow.parquet?
> >> > > >    - Or can pages start at r-level 0? If so, then what is the
> >> > > significance
> >> > > >    of the comments in parquet.thrift?
> >> > > >
> >> > > >
> >> > > > Cheers,
> >> > > > Jan
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to