Hey Parquet devs, I so far thought that Parquet mandates that records start at page boundaries, i.e., at r-level 0, and we have relied on this fact in some places of our engine. That means, there cannot be any data page for a REPEATED column that starts at an r-level > 0, as this would mean that a record would be split between multiple pages.
I also found the two comments in parquet.thrift: /** Number of rows in this data page. which means pages change on record > boundaries (r = 0) **/ > 3: required i32 num_rows /** > * Index within the RowGroup of the first row of the page; this means > pages > * change on record boundaries (r = 0). > */ > 3: required i64 first_row_index These comments seem to imply that my understanding is correct. However, they are worded very weakly, not like a mandate but more like a "by the way" comment. I haven't found any other mention of r-levels and page boundaries in the parquet-format repo (maybe I missed them?). I recently noticed that pyarrow.parquet splits repeated fields over multiple pages, so it violates this. This triggers assertions in our engine, so I want to understand what's the right course of action here. So, can we please clarify: *Does Parquet mandate that pages need to start at r-level 0?* - I.e., is a parquet file with a page that starts at an r-level > 0 ill formed? I.e., is this a bug in pyarrow.parquet? - Or can pages start at r-level 0? If so, then what is the significance of the comments in parquet.thrift? Cheers, Jan