Hey Parquet devs,

I so far thought that Parquet mandates that records start at page
boundaries, i.e., at r-level 0, and we have relied on this fact in some
places of our engine. That means, there cannot be any data page for a
REPEATED column that starts at an r-level > 0, as this would mean that a
record would be split between multiple pages.

I also found the two comments in parquet.thrift:

  /** Number of rows in this data page. which means pages change on record
> boundaries (r = 0) **/
>   3: required i32 num_rows


  /**
>    * Index within the RowGroup of the first row of the page; this means
> pages
>    * change on record boundaries (r = 0).
>    */
>   3: required i64 first_row_index


These comments seem to imply that my understanding is correct. However,
they are worded very weakly, not like a mandate but more like a "by the
way" comment.

I haven't found any other mention of r-levels and page boundaries in the
parquet-format repo (maybe I missed them?).

I recently noticed that pyarrow.parquet splits repeated fields over
multiple pages, so it violates this. This triggers assertions in our
engine, so I want to understand what's the right course of action here.

So, can we please clarify:
*Does Parquet mandate that pages need to start at r-level 0?*

   - I.e., is a parquet file with a page that starts at an r-level > 0 ill
   formed? I.e., is this a bug in pyarrow.parquet?
   - Or can pages start at r-level 0? If so, then what is the significance
   of the comments in parquet.thrift?


Cheers,
Jan

Reply via email to