> I would like to ask if it is valid to create only ColumnIndex but omit OffsetIndex? My answer is NO according to [1].
I agree with this interpretation and in fact [1] makes this explicit. > BTW, it seems totally valid to create page index for a subset of all columns. Does it mean columns without page index can have their records spanning more than one page? That would be my interpretation of the current spec as well > 'd like to suggest that we recommend writers do not ever split records across pages, as frankly it is quite a surprising behavior. I agree that this would be a good recommendation (likely on the grounds that not splitting records across pages will maximize compatibility). Perhaps once merge the PR that clarifies what the current spec means I can create a follow on PR with proposed recommendation language. Andrew [1] https://github.com/apache/parquet-format/pull/245 On Tue, May 21, 2024 at 7:45 AM Raphael Taylor-Davies <[email protected]> wrote: > I'd like to suggest that we recommend writers do not ever split records > across pages, as frankly it is quite a surprising behavior. However, as > this was ambiguous historically, readers should tolerate it in the > absence of an offset index. This ensures backwards compatibility, whilst > encouraging writers not to do this, and ensuring that offset indexes can > be used to prune IO. This is the approach we have taken in parquet-rs [1]. > > Kind Regards, > > Raphael > > [1]: https://github.com/apache/arrow-rs/pull/4943 > > On 21/05/2024 12:31, Gang Wu wrote: > > BTW, it seems totally valid to create page index for a subset of > > all columns. Does it mean columns without page index can have > > their records spanning more than one page? > > > > Best, > > Gang > > > > On Tue, May 21, 2024 at 7:26 PM Gang Wu <[email protected]> wrote: > > > >> I would like to ask if it is valid to create only ColumnIndex but omit > >> OffsetIndex? > >> My answer is NO according to [1]. If agreed, my inclination is option 1. > >> > >> [1] > >> > https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L1019-L1022 > >> > >> > >> > >> On Tue, May 21, 2024 at 6:31 PM wish maple <[email protected]> > wrote: > >> > >>> I'm +1 on this, "Offset Index", "Page Index", "Column Index or Offset > >>> Index" all looks good to me. > >>> > >>> Best, > >>> Xuwei Fu > >>> > >>> Andrew Lamb <[email protected]> 于2024年5月21日周二 18:07写道: > >>> > >>>> mapleFU brought up an excellent question[1]. > >>>> > >>>> Upon further research, a "page index" seems to consist of an > OffsetIndex > >>>> and ColumnIndex, but some writers may only write OffsetIndex (and not > >>>> ColumnIndex). See discussion on [2] > >>>> > >>>> Thus when we say "repeated fields must start at a page boundary if a > >>> page > >>>> index is present OR data-page V2 is present," does that mean: > >>>> 1. an OffsetIndex is present > >>>> 2. both an OffsetIndex and ColumnIndex are present > >>>> 3. Something else > >>>> > >>>> It seems to me that since an OffsetIndex is in terms of numbers of > >>> records, > >>>> if it were present that would require repetition_level=0 at page > >>>> boundaries (aka option 1). > >>>> > >>>> Thoughts? > >>>> Andrew > >>>> > >>>> > >>>> [1] > >>>> > >>> > https://github.com/apache/parquet-format/pull/244#discussion_r1607878045 > >>>> [2]: https://github.com/apache/parquet-format/pull/245 > >>>> > >>>> On Sun, May 19, 2024 at 7:18 AM Andrew Lamb <[email protected]> > >>>> wrote: > >>>> > >>>>> I have created a PR[1] to the spec to try and encode this mailing > list > >>>>> conversation and avoid future confusion. Please have a look and let > >>> me > >>>>> know if it captures it correctly. > >>>>> > >>>>> Thanks, > >>>>> Andrew > >>>>> > >>>>> [1]: https://github.com/apache/parquet-format/pull/244 > >>>>> > >>>>> On Wed, May 15, 2024 at 7:03 PM Julien Le Dem <[email protected]> > >>> wrote: > >>>>>> +1 The semantics of a row group is that it contains rows and > >>> therefore > >>>>>> starts on R=0 > >>>>>> I generally echo Ed's sentiment here. > >>>>>> > >>>>>> On Wed, May 15, 2024 at 8:01 AM Andrew Lamb <[email protected] > > > >>>>>> wrote: > >>>>>> > >>>>>>> Thank you all -- I have filed > >>>>>>> https://issues.apache.org/jira/browse/PARQUET-2473 to track > >>>> clarifying > >>>>>> the > >>>>>>> spec and will make a PR shortly > >>>>>>> > >>>>>>> > >>>>>>> On Sun, May 12, 2024 at 12:18 AM wish maple < > >>> [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> IMO when Page V2 is present or PageIndex is enabled, the > >>> boundaries > >>>>>>>> should be check[1] > >>>>>>>> > >>>>>>>> [1] > >>>>>>>> > >>>>>>>> > >>> > https://github.com/apache/arrow/blob/d10ebf055a393c94a693097db1dca08ff86745bd/cpp/src/parquet/column_writer.cc#L1235-L1237 > >>>>>>>> > >>>>>>>> Jan Finis <[email protected]> 于2024年5月11日周六 01:15写道: > >>>>>>>> > >>>>>>>>> Hey Parquet devs, > >>>>>>>>> > >>>>>>>>> I so far thought that Parquet mandates that records start at > >>> page > >>>>>>>>> boundaries, i.e., at r-level 0, and we have relied on this > >>> fact in > >>>>>> some > >>>>>>>>> places of our engine. That means, there cannot be any data page > >>>> for > >>>>>> a > >>>>>>>>> REPEATED column that starts at an r-level > 0, as this would > >>> mean > >>>>>> that > >>>>>>> a > >>>>>>>>> record would be split between multiple pages. > >>>>>>>>> > >>>>>>>>> I also found the two comments in parquet.thrift: > >>>>>>>>> > >>>>>>>>> /** Number of rows in this data page. which means pages > >>> change > >>>> on > >>>>>>>> record > >>>>>>>>>> boundaries (r = 0) **/ > >>>>>>>>>> 3: required i32 num_rows > >>>>>>>>> > >>>>>>>>> /** > >>>>>>>>>> * Index within the RowGroup of the first row of the page; > >>>> this > >>>>>>> means > >>>>>>>>>> pages > >>>>>>>>>> * change on record boundaries (r = 0). > >>>>>>>>>> */ > >>>>>>>>>> 3: required i64 first_row_index > >>>>>>>>> > >>>>>>>>> These comments seem to imply that my understanding is correct. > >>>>>> However, > >>>>>>>>> they are worded very weakly, not like a mandate but more like a > >>>> "by > >>>>>> the > >>>>>>>>> way" comment. > >>>>>>>>> > >>>>>>>>> I haven't found any other mention of r-levels and page > >>> boundaries > >>>> in > >>>>>>> the > >>>>>>>>> parquet-format repo (maybe I missed them?). > >>>>>>>>> > >>>>>>>>> I recently noticed that pyarrow.parquet splits repeated fields > >>>> over > >>>>>>>>> multiple pages, so it violates this. This triggers assertions > >>> in > >>>> our > >>>>>>>>> engine, so I want to understand what's the right course of > >>> action > >>>>>> here. > >>>>>>>>> So, can we please clarify: > >>>>>>>>> *Does Parquet mandate that pages need to start at r-level 0?* > >>>>>>>>> > >>>>>>>>> - I.e., is a parquet file with a page that starts at an > >>> r-level > >>>>>>> 0 > >>>>>>>> ill > >>>>>>>>> formed? I.e., is this a bug in pyarrow.parquet? > >>>>>>>>> - Or can pages start at r-level 0? If so, then what is the > >>>>>>>> significance > >>>>>>>>> of the comments in parquet.thrift? > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> Jan > >>>>>>>>> >
