BTW, it seems totally valid to create page index for a subset of all columns. Does it mean columns without page index can have their records spanning more than one page?
Best, Gang On Tue, May 21, 2024 at 7:26 PM Gang Wu <[email protected]> wrote: > I would like to ask if it is valid to create only ColumnIndex but omit > OffsetIndex? > My answer is NO according to [1]. If agreed, my inclination is option 1. > > [1] > https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022 > > > > On Tue, May 21, 2024 at 6:31 PM wish maple <[email protected]> wrote: > >> I'm +1 on this, "Offset Index", "Page Index", "Column Index or Offset >> Index" all looks good to me. >> >> Best, >> Xuwei Fu >> >> Andrew Lamb <[email protected]> 于2024年5月21日周二 18:07写道: >> >> > mapleFU brought up an excellent question[1]. >> > >> > Upon further research, a "page index" seems to consist of an OffsetIndex >> > and ColumnIndex, but some writers may only write OffsetIndex (and not >> > ColumnIndex). See discussion on [2] >> > >> > Thus when we say "repeated fields must start at a page boundary if a >> page >> > index is present OR data-page V2 is present," does that mean: >> > 1. an OffsetIndex is present >> > 2. both an OffsetIndex and ColumnIndex are present >> > 3. Something else >> > >> > It seems to me that since an OffsetIndex is in terms of numbers of >> records, >> > if it were present that would require repetition_level=0 at page >> > boundaries (aka option 1). >> > >> > Thoughts? >> > Andrew >> > >> > >> > [1] >> > >> https://github.com/apache/parquet-format/pull/244#discussion_r1607878045 >> > [2]: https://github.com/apache/parquet-format/pull/245 >> > >> > On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <[email protected]> >> > wrote: >> > >> > > I have created a PR[1] to the spec to try and encode this mailing list >> > > conversation and avoid future confusion. Please have a look and let >> me >> > > know if it captures it correctly. >> > > >> > > Thanks, >> > > Andrew >> > > >> > > [1]: https://github.com/apache/parquet-format/pull/244 >> > > >> > > On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <[email protected]> >> wrote: >> > > >> > >> +1 The semantics of a row group is that it contains rows and >> therefore >> > >> starts on R=0 >> > >> I generally echo Ed's sentiment here. >> > >> >> > >> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <[email protected]> >> > >> wrote: >> > >> >> > >> > Thank you all -- I have filed >> > >> > https://issues.apache.org/jira/browse/PARQUET-2473 to track >> > clarifying >> > >> the >> > >> > spec and will make a PR shortly >> > >> > >> > >> > >> > >> > On Sun, May 12, 2024 at 12:18 AM wish maple < >> [email protected]> >> > >> > wrote: >> > >> > >> > >> > > IMO when Page V2 is present or PageIndex is enabled, the >> boundaries >> > >> > > should be check[1] >> > >> > > >> > >> > > [1] >> > >> > > >> > >> > > >> > >> > >> > >> >> > >> https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237 >> > >> > > >> > >> > > >> > >> > > Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道: >> > >> > > >> > >> > > > Hey Parquet devs, >> > >> > > > >> > >> > > > I so far thought that Parquet mandates that records start at >> page >> > >> > > > boundaries, i.e., at r-level 0, and we have relied on this >> fact in >> > >> some >> > >> > > > places of our engine. That means, there cannot be any data page >> > for >> > >> a >> > >> > > > REPEATED column that starts at an r-level > 0, as this would >> mean >> > >> that >> > >> > a >> > >> > > > record would be split between multiple pages. >> > >> > > > >> > >> > > > I also found the two comments in parquet.thrift: >> > >> > > > >> > >> > > > /** Number of rows in this data page. which means pages >> change >> > on >> > >> > > record >> > >> > > > > boundaries (r = 0) **/ >> > >> > > > > 3: required i32 num_rows >> > >> > > > >> > >> > > > >> > >> > > > /** >> > >> > > > > * Index within the RowGroup of the first row of the page; >> > this >> > >> > means >> > >> > > > > pages >> > >> > > > > * change on record boundaries (r = 0). >> > >> > > > > */ >> > >> > > > > 3: required i64 first_row_index >> > >> > > > >> > >> > > > >> > >> > > > These comments seem to imply that my understanding is correct. >> > >> However, >> > >> > > > they are worded very weakly, not like a mandate but more like a >> > "by >> > >> the >> > >> > > > way" comment. >> > >> > > > >> > >> > > > I haven't found any other mention of r-levels and page >> boundaries >> > in >> > >> > the >> > >> > > > parquet-format repo (maybe I missed them?). >> > >> > > > >> > >> > > > I recently noticed that pyarrow.parquet splits repeated fields >> > over >> > >> > > > multiple pages, so it violates this. This triggers assertions >> in >> > our >> > >> > > > engine, so I want to understand what's the right course of >> action >> > >> here. >> > >> > > > >> > >> > > > So, can we please clarify: >> > >> > > > *Does Parquet mandate that pages need to start at r-level 0?* >> > >> > > > >> > >> > > > - I.e., is a parquet file with a page that starts at an >> r-level >> > >> > 0 >> > >> > > ill >> > >> > > > formed? I.e., is this a bug in pyarrow.parquet? >> > >> > > > - Or can pages start at r-level 0? If so, then what is the >> > >> > > significance >> > >> > > > of the comments in parquet.thrift? >> > >> > > > >> > >> > > > >> > >> > > > Cheers, >> > >> > > > Jan >> > >> > > > >> > >> > > >> > >> > >> > >> >> > > >> > >> >
