+1 The semantics of a row group is that it contains rows and therefore starts on R=0 I generally echo Ed's sentiment here.
On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <[email protected]> wrote: > Thank you all -- I have filed > https://issues.apache.org/jira/browse/PARQUET-2473 to track clarifying the > spec and will make a PR shortly > > > On Sun, May 12, 2024 at 12:18 AM wish maple <[email protected]> > wrote: > > > IMO when Page V2 is present or PageIndex is enabled, the boundaries > > should be check[1] > > > > [1] > > > > > https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237 > > > > > > Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道: > > > > > Hey Parquet devs, > > > > > > I so far thought that Parquet mandates that records start at page > > > boundaries, i.e., at r-level 0, and we have relied on this fact in some > > > places of our engine. That means, there cannot be any data page for a > > > REPEATED column that starts at an r-level > 0, as this would mean that > a > > > record would be split between multiple pages. > > > > > > I also found the two comments in parquet.thrift: > > > > > > /** Number of rows in this data page. which means pages change on > > record > > > > boundaries (r = 0) **/ > > > > 3: required i32 num_rows > > > > > > > > > /** > > > > * Index within the RowGroup of the first row of the page; this > means > > > > pages > > > > * change on record boundaries (r = 0). > > > > */ > > > > 3: required i64 first_row_index > > > > > > > > > These comments seem to imply that my understanding is correct. However, > > > they are worded very weakly, not like a mandate but more like a "by the > > > way" comment. > > > > > > I haven't found any other mention of r-levels and page boundaries in > the > > > parquet-format repo (maybe I missed them?). > > > > > > I recently noticed that pyarrow.parquet splits repeated fields over > > > multiple pages, so it violates this. This triggers assertions in our > > > engine, so I want to understand what's the right course of action here. > > > > > > So, can we please clarify: > > > *Does Parquet mandate that pages need to start at r-level 0?* > > > > > > - I.e., is a parquet file with a page that starts at an r-level > 0 > > ill > > > formed? I.e., is this a bug in pyarrow.parquet? > > > - Or can pages start at r-level 0? If so, then what is the > > significance > > > of the comments in parquet.thrift? > > > > > > > > > Cheers, > > > Jan > > > > > >
