Interesting, thanks for the input so far. Since the spec doesn't say this exactly, let me spin this one step further:
May *row groups* start with an R-Level > 0? Intuitively, I would say "hell no", but there is nothing in the Parquet spec that would say that this is forbidden. Am Fr., 10. Mai 2024 um 12:21 Uhr schrieb Micah Kornfield < [email protected]>: > > > > - I.e., is a parquet file with a page that starts at an r-level > 0 > ill > > formed? I.e., is this a bug in pyarrow.parquet? > > As noted above, my understanding is that is only ill-formed if a page index > is present OR data-page V2 is present. If neither hold, then I think it is > a valid parquet file. > > This was a long standing bug in Arrow for V2 pages which should have been > fixed in > > https://github.com/apache/arrow/commit/b888f4d6c7dc490ce17b9f64d32af23ffc6f4617 > > On Fri, May 10, 2024 at 11:48 AM Julien Le Dem > <[email protected]> wrote: > > > Jan, your understanding of the Parquet spec is correct. > > The semantics of "num_rows" and "first_row_index" do require records to > > *not* be split across pages. > > Push downs and page skipping require this to be true. > > I would consider the behavior of splitting a record across pages as a bug > > in pyarrow.parquet. > > I'd support updating the spec to have stronger language if you think it > is > > necessary. > > > > On Fri, May 10, 2024 at 11:36 AM Andrew Lamb <[email protected]> > > wrote: > > > > > We encountered a similar question / issue in the Rust parquet > > > implementation[1]. > > > > > > Raphael's conclusion was that pages need to start with r-level 0 if > using > > > V2 data pages or if there is a page index. Among other reasons, if this > > > doesn't hold, it is not possible to do pushdown on nested columns as > you > > > have no idea where the last record actually ends. > > > > > > We updated the parquet-rs reader to make this assumption in [2] > > > > > > If others on this thread agree I would be happy to draft a spec > > > clarification on this point > > > > > > Andrew > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/3680 > > > [2] https://github.com/apache/arrow-rs/pull/4943 > > > > > > > > > > > > On Fri, May 10, 2024 at 1:15 PM Jan Finis <[email protected]> wrote: > > > > > > > Hey Parquet devs, > > > > > > > > I so far thought that Parquet mandates that records start at page > > > > boundaries, i.e., at r-level 0, and we have relied on this fact in > some > > > > places of our engine. That means, there cannot be any data page for a > > > > REPEATED column that starts at an r-level > 0, as this would mean > that > > a > > > > record would be split between multiple pages. > > > > > > > > I also found the two comments in parquet.thrift: > > > > > > > > /** Number of rows in this data page. which means pages change on > > > record > > > > > boundaries (r = 0) **/ > > > > > 3: required i32 num_rows > > > > > > > > > > > > /** > > > > > * Index within the RowGroup of the first row of the page; this > > means > > > > > pages > > > > > * change on record boundaries (r = 0). > > > > > */ > > > > > 3: required i64 first_row_index > > > > > > > > > > > > These comments seem to imply that my understanding is correct. > However, > > > > they are worded very weakly, not like a mandate but more like a "by > the > > > > way" comment. > > > > > > > > I haven't found any other mention of r-levels and page boundaries in > > the > > > > parquet-format repo (maybe I missed them?). > > > > > > > > I recently noticed that pyarrow.parquet splits repeated fields over > > > > multiple pages, so it violates this. This triggers assertions in our > > > > engine, so I want to understand what's the right course of action > here. > > > > > > > > So, can we please clarify: > > > > *Does Parquet mandate that pages need to start at r-level 0?* > > > > > > > > - I.e., is a parquet file with a page that starts at an r-level > > 0 > > > ill > > > > formed? I.e., is this a bug in pyarrow.parquet? > > > > - Or can pages start at r-level 0? If so, then what is the > > > significance > > > > of the comments in parquet.thrift? > > > > > > > > > > > > Cheers, > > > > Jan > > > > > > > > > >
