mapleFU brought up an excellent question[1].

Upon further research, a "page index" seems to consist of an OffsetIndex
and ColumnIndex, but some writers may only write OffsetIndex (and not
ColumnIndex). See discussion on [2]

Thus when we say "repeated fields must start at a page boundary if a page
index is present OR data-page V2 is present," does that mean:
1. an OffsetIndex is present
2. both an OffsetIndex and ColumnIndex are present
3. Something else

It seems to me that since an OffsetIndex is in terms of numbers of records,
if it were present that would require repetition_level=0 at page
boundaries (aka option 1).

Thoughts?
Andrew


[1] https://github.com/apache/parquet-format/pull/244#discussion_r1607878045
[2]: https://github.com/apache/parquet-format/pull/245

On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> I have created a PR[1] to the spec to try and encode this mailing list
> conversation and avoid future confusion.  Please have a look and let me
> know if it captures it correctly.
>
> Thanks,
> Andrew
>
> [1]: https://github.com/apache/parquet-format/pull/244
>
> On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <jul...@apache.org> wrote:
>
>> +1 The semantics of a row group is that it contains rows and therefore
>> starts on R=0
>> I generally echo Ed's sentiment here.
>>
>> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <andrewlam...@gmail.com>
>> wrote:
>>
>> > Thank you all -- I have filed
>> > https://issues.apache.org/jira/browse/PARQUET-2473 to track clarifying
>> the
>> > spec and will make a PR shortly
>> >
>> >
>> > On Sun, May 12, 2024 at 12:18 AM wish maple <maplewish...@gmail.com>
>> > wrote:
>> >
>> > > IMO when Page V2 is present or PageIndex is enabled, the boundaries
>> > > should be check[1]
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
>> > >
>> > >
>> > > Jan Finis <jpfi...@gmail.com> 于2024年5月11日周六 01:15写道:
>> > >
>> > > > Hey Parquet devs,
>> > > >
>> > > > I so far thought that Parquet mandates that records start at page
>> > > > boundaries, i.e., at r-level 0, and we have relied on this fact in
>> some
>> > > > places of our engine. That means, there cannot be any data page for
>> a
>> > > > REPEATED column that starts at an r-level > 0, as this would mean
>> that
>> > a
>> > > > record would be split between multiple pages.
>> > > >
>> > > > I also found the two comments in parquet.thrift:
>> > > >
>> > > >   /** Number of rows in this data page. which means pages change on
>> > > record
>> > > > > boundaries (r = 0) **/
>> > > > >   3: required i32 num_rows
>> > > >
>> > > >
>> > > >   /**
>> > > > >    * Index within the RowGroup of the first row of the page; this
>> > means
>> > > > > pages
>> > > > >    * change on record boundaries (r = 0).
>> > > > >    */
>> > > > >   3: required i64 first_row_index
>> > > >
>> > > >
>> > > > These comments seem to imply that my understanding is correct.
>> However,
>> > > > they are worded very weakly, not like a mandate but more like a "by
>> the
>> > > > way" comment.
>> > > >
>> > > > I haven't found any other mention of r-levels and page boundaries in
>> > the
>> > > > parquet-format repo (maybe I missed them?).
>> > > >
>> > > > I recently noticed that pyarrow.parquet splits repeated fields over
>> > > > multiple pages, so it violates this. This triggers assertions in our
>> > > > engine, so I want to understand what's the right course of action
>> here.
>> > > >
>> > > > So, can we please clarify:
>> > > > *Does Parquet mandate that pages need to start at r-level 0?*
>> > > >
>> > > >    - I.e., is a parquet file with a page that starts at an r-level
>> > 0
>> > > ill
>> > > >    formed? I.e., is this a bug in pyarrow.parquet?
>> > > >    - Or can pages start at r-level 0? If so, then what is the
>> > > significance
>> > > >    of the comments in parquet.thrift?
>> > > >
>> > > >
>> > > > Cheers,
>> > > > Jan
>> > > >
>> > >
>> >
>>
>

Reply via email to