BTW, it seems totally valid to create page index for a subset of
all columns. Does it mean columns without page index can have
their records spanning more than one page?

Best,
Gang

On Tue, May 21, 2024 at 7:26 PM Gang Wu <[email protected]> wrote:

> I would like to ask if it is valid to create only ColumnIndex but omit
> OffsetIndex?
> My answer is NO according to [1]. If agreed, my inclination is option 1.
>
> [1]
> https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022
>
>
>
> On Tue, May 21, 2024 at 6:31 PM wish maple <[email protected]> wrote:
>
>> I'm +1 on this, "Offset Index", "Page Index", "Column Index or Offset
>> Index" all looks good to me.
>>
>> Best,
>> Xuwei Fu
>>
>> Andrew Lamb <[email protected]> 于2024年5月21日周二 18:07写道:
>>
>> > mapleFU brought up an excellent question[1].
>> >
>> > Upon further research, a "page index" seems to consist of an OffsetIndex
>> > and ColumnIndex, but some writers may only write OffsetIndex (and not
>> > ColumnIndex). See discussion on [2]
>> >
>> > Thus when we say "repeated fields must start at a page boundary if a
>> page
>> > index is present OR data-page V2 is present," does that mean:
>> > 1. an OffsetIndex is present
>> > 2. both an OffsetIndex and ColumnIndex are present
>> > 3. Something else
>> >
>> > It seems to me that since an OffsetIndex is in terms of numbers of
>> records,
>> > if it were present that would require repetition_level=0 at page
>> > boundaries (aka option 1).
>> >
>> > Thoughts?
>> > Andrew
>> >
>> >
>> > [1]
>> >
>> https://github.com/apache/parquet-format/pull/244#discussion_r1607878045
>> > [2]: https://github.com/apache/parquet-format/pull/245
>> >
>> > On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <[email protected]>
>> > wrote:
>> >
>> > > I have created a PR[1] to the spec to try and encode this mailing list
>> > > conversation and avoid future confusion.  Please have a look and let
>> me
>> > > know if it captures it correctly.
>> > >
>> > > Thanks,
>> > > Andrew
>> > >
>> > > [1]: https://github.com/apache/parquet-format/pull/244
>> > >
>> > > On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <[email protected]>
>> wrote:
>> > >
>> > >> +1 The semantics of a row group is that it contains rows and
>> therefore
>> > >> starts on R=0
>> > >> I generally echo Ed's sentiment here.
>> > >>
>> > >> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <[email protected]>
>> > >> wrote:
>> > >>
>> > >> > Thank you all -- I have filed
>> > >> > https://issues.apache.org/jira/browse/PARQUET-2473 to track
>> > clarifying
>> > >> the
>> > >> > spec and will make a PR shortly
>> > >> >
>> > >> >
>> > >> > On Sun, May 12, 2024 at 12:18 AM wish maple <
>> [email protected]>
>> > >> > wrote:
>> > >> >
>> > >> > > IMO when Page V2 is present or PageIndex is enabled, the
>> boundaries
>> > >> > > should be check[1]
>> > >> > >
>> > >> > > [1]
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237
>> > >> > >
>> > >> > >
>> > >> > > Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道:
>> > >> > >
>> > >> > > > Hey Parquet devs,
>> > >> > > >
>> > >> > > > I so far thought that Parquet mandates that records start at
>> page
>> > >> > > > boundaries, i.e., at r-level 0, and we have relied on this
>> fact in
>> > >> some
>> > >> > > > places of our engine. That means, there cannot be any data page
>> > for
>> > >> a
>> > >> > > > REPEATED column that starts at an r-level > 0, as this would
>> mean
>> > >> that
>> > >> > a
>> > >> > > > record would be split between multiple pages.
>> > >> > > >
>> > >> > > > I also found the two comments in parquet.thrift:
>> > >> > > >
>> > >> > > >   /** Number of rows in this data page. which means pages
>> change
>> > on
>> > >> > > record
>> > >> > > > > boundaries (r = 0) **/
>> > >> > > > >   3: required i32 num_rows
>> > >> > > >
>> > >> > > >
>> > >> > > >   /**
>> > >> > > > >    * Index within the RowGroup of the first row of the page;
>> > this
>> > >> > means
>> > >> > > > > pages
>> > >> > > > >    * change on record boundaries (r = 0).
>> > >> > > > >    */
>> > >> > > > >   3: required i64 first_row_index
>> > >> > > >
>> > >> > > >
>> > >> > > > These comments seem to imply that my understanding is correct.
>> > >> However,
>> > >> > > > they are worded very weakly, not like a mandate but more like a
>> > "by
>> > >> the
>> > >> > > > way" comment.
>> > >> > > >
>> > >> > > > I haven't found any other mention of r-levels and page
>> boundaries
>> > in
>> > >> > the
>> > >> > > > parquet-format repo (maybe I missed them?).
>> > >> > > >
>> > >> > > > I recently noticed that pyarrow.parquet splits repeated fields
>> > over
>> > >> > > > multiple pages, so it violates this. This triggers assertions
>> in
>> > our
>> > >> > > > engine, so I want to understand what's the right course of
>> action
>> > >> here.
>> > >> > > >
>> > >> > > > So, can we please clarify:
>> > >> > > > *Does Parquet mandate that pages need to start at r-level 0?*
>> > >> > > >
>> > >> > > >    - I.e., is a parquet file with a page that starts at an
>> r-level
>> > >> > 0
>> > >> > > ill
>> > >> > > >    formed? I.e., is this a bug in pyarrow.parquet?
>> > >> > > >    - Or can pages start at r-level 0? If so, then what is the
>> > >> > > significance
>> > >> > > >    of the comments in parquet.thrift?
>> > >> > > >
>> > >> > > >
>> > >> > > > Cheers,
>> > >> > > > Jan
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>

Reply via email to