IMO when Page V2 is present or PageIndex is enabled, the boundaries
should be check[1]

[1]
https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237


Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道:

> Hey Parquet devs,
>
> I so far thought that Parquet mandates that records start at page
> boundaries, i.e., at r-level 0, and we have relied on this fact in some
> places of our engine. That means, there cannot be any data page for a
> REPEATED column that starts at an r-level > 0, as this would mean that a
> record would be split between multiple pages.
>
> I also found the two comments in parquet.thrift:
>
>   /** Number of rows in this data page. which means pages change on record
> > boundaries (r = 0) **/
> >   3: required i32 num_rows
>
>
>   /**
> >    * Index within the RowGroup of the first row of the page; this means
> > pages
> >    * change on record boundaries (r = 0).
> >    */
> >   3: required i64 first_row_index
>
>
> These comments seem to imply that my understanding is correct. However,
> they are worded very weakly, not like a mandate but more like a "by the
> way" comment.
>
> I haven't found any other mention of r-levels and page boundaries in the
> parquet-format repo (maybe I missed them?).
>
> I recently noticed that pyarrow.parquet splits repeated fields over
> multiple pages, so it violates this. This triggers assertions in our
> engine, so I want to understand what's the right course of action here.
>
> So, can we please clarify:
> *Does Parquet mandate that pages need to start at r-level 0?*
>
>    - I.e., is a parquet file with a page that starts at an r-level > 0 ill
>    formed? I.e., is this a bug in pyarrow.parquet?
>    - Or can pages start at r-level 0? If so, then what is the significance
>    of the comments in parquet.thrift?
>
>
> Cheers,
> Jan
>

Reply via email to