I've created https://github.com/apache/parquet-format/pull/163 to try to document these (note I really don't have historical context here so please review carefully).
I would appreciate it if someone could point me to a reference on what the current status of V2 is? What is left unsettled? When can we start recommending it for production use? Thanks, Micah On Tue, Oct 13, 2020 at 9:23 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > I am not sure 2.0 means the v2 pages here. I think there was/is a bit of >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the >> parquet-format releases are also part of it. > > > +1 to the confusion part. The reason why I originally started this thread > is that none of this is entirely clear to me from existing documentation. > > In particular it is confusing to me to say that the V2 Spec is not yet > finished when it looks like there have been multiple V2 Format releases. > > It would be extremely useful to have documentation relating features to: > 1. The version of the spec they are part of > 2. There current status in reference implementations > > Thanks, > Micah > > > On Tue, Oct 13, 2020 at 1:51 AM Gabor Szadovszky <ga...@apache.org> wrote: > >> I am not sure 2.0 means the v2 pages here. I think there was/is a bit of >> confusion between the v1/v2 pages and the parquet-mr releases. Maybe the >> parquet-format releases are also part of it. >> In this table many features are not related to the pages so I don't think >> the "Expected release" meant the v1/v2 pages. I guess there was an earlier >> plan to release parquet-mr 2.0 with the v2 pages but then v2 pages were >> released in a 1.x release while 2.0 is not planned yet. (I was not in the >> community that time so I'm only guessing.) >> >> Also worth to mention that it seems to be not related to the >> parquet-format >> releases which means that based on the spec the implementations were/are >> not limited by this table. >> >> >> On Mon, Oct 12, 2020 at 6:53 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >> > I remembered that there used to be a table. Looks like it was removed: >> > >> > >> https://github.com/apache/parquet-mr/commit/aed9097640c7adffe1151b32e86b5efc3702c657?short_path=b335630#diff-04c6e90faac2675aa89e2176d2eec7d8 >> > >> > The table used to list delta as a 2.0 feature. >> > >> > On Mon, Oct 12, 2020 at 1:38 AM Gabor Szadovszky <ga...@apache.org> >> wrote: >> > >> > > That answer I wrote to the other thread was based on the current >> code. So >> > > that is how parquet-mr is working now. It does not mean though how >> shall >> > it >> > > work or how it works in other implementations. Unfortunately, the spec >> > does >> > > not say anything about v1 and v2 in the context of encodings. >> > > Meanwhile, enabling the "new" encodings in v1 may generate >> compatibility >> > > issues with other implementations. (I am not sure how would the >> existing >> > > releases of parquet-mr behave if they have to read v1 pages with these >> > > encodings but I believe it would work fine.) >> > > >> > > I think, it would be a good idea to keep the existing default >> behavior as >> > > is but introduce some new flags where the user may set/suggest >> encodings >> > > for the different columns. This way the user can hold the risk of >> being >> > > potentially incompatible with other implementations (for the time >> being) >> > > and also can fine tune the encodings for the data. This way we can >> also >> > > introduce some new encodings that are better in some cases (e.g. lossy >> > > compression for floating point numbers). >> > > >> > > What do you guys thing? >> > > (I would be happy to help anyone would like to contribute in this >> topic.) >> > > >> > > Cheers, >> > > Gabor >> > > >> > > On Sat, Oct 10, 2020 at 2:18 AM Jacques Nadeau <jacq...@apache.org> >> > wrote: >> > > >> > > > Gabor seems to agree that delta is V2 only. >> > > > >> > > > To summarize, no delta encodings are used for V1 pages. They are >> > > available >> > > > > for V2 only. >> > > > >> > > > >> > > > https://www.mail-archive.com/dev@parquet.apache.org/msg11826.html >> > > > >> > > > >> > > > >> > > > On Fri, Oct 9, 2020 at 5:06 PM Jacques Nadeau <jacq...@apache.org> >> > > wrote: >> > > > >> > > > > Good point. I had mentally categorized this as V2, not based on >> the >> > > docs? >> > > > > >> > > > > I don't think most tools write this but I can't see anywhere that >> it >> > > says >> > > > > it is limited to v2 readers/writers. I'm not sure how many tools >> > > > vectorize >> > > > > read it versus delegate to the legacy mr path (at least in java), >> > > either. >> > > > > >> > > > > On Fri, Oct 9, 2020 at 2:25 PM Micah Kornfield < >> > emkornfi...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> The big win in v2 pages (if I remember correctly) is that the >> > variable >> > > > >>> length encoding is no longer interleaved. That would provide a >> big >> > > > >>> performance lift when pulling into arrow vectors (and variable >> > length >> > > > >>> decoding typically dominates total read processing time, on >> average >> > > > I've >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over >> > scalar >> > > > >>> reads). AFAIK, there is still no option for that in V1. >> > > > >> >> > > > >> >> > > > >> For Delta-length byte the documentation [1] states "This >> encoding >> > is >> > > > >> always preferred over PLAIN for byte array columns." makes it >> sound >> > > like >> > > > >> this could be part of V1 or V2. The encoding enum in the Thrift >> > file >> > > > [2] >> > > > >> doesn't seem to document this either. >> > > > >> >> > > > >> Is there clearer documentation for what encodings are considered >> > part >> > > of >> > > > >> v1 vs v2? >> > > > >> >> > > > >> Thanks, >> > > > >> Micah >> > > > >> >> > > > >> >> > > > >> [1] >> > > > >> >> > > > >> > > >> > >> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 >> > > > >> [2] >> > > > >> >> > > > >> > > >> > >> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407 >> > > > >> >> > > > >> On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau < >> jacq...@apache.org> >> > > > >> wrote: >> > > > >> >> > > > >>> The big win in v2 pages (if I remember correctly) is that the >> > > variable >> > > > >>> length encoding is no longer interleaved. That would provide a >> big >> > > > >>> performance lift when pulling into arrow vectors (and variable >> > length >> > > > >>> decoding typically dominates total read processing time, on >> average >> > > > I've >> > > > >>> seen 5-10x per cell cpu cost increase for variable reads over >> > scalar >> > > > >>> reads). AFAIK, there is still no option for that in V1. >> > > > >>> >> > > > >>> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield < >> > > emkornfi...@gmail.com >> > > > > >> > > > >>> wrote: >> > > > >>> >> > > > >>>> Thanks for the quick reply Ryan. >> > > > >>>> >> > > > >>>> >> > > > >>>> > We only use v1 and it still works well. That said, I'd love >> to >> > > make >> > > > >>>> some >> > > > >>>> > progress on better encodings and finalizing v2 so we can use >> > them! >> > > > >>>> >> > > > >>>> >> > > > >>>> Are there JIRAs or other documentation that is tracking this >> work? >> > > > >>>> >> > > > >>>> Thanks, >> > > > >>>> Micah >> > > > >>>> >> > > > >>>> >> > > > >>>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <rb...@netflix.com> >> > > wrote: >> > > > >>>> >> > > > >>>> > While there isn't anything wrong with it, the same challenges >> > have >> > > > >>>> been >> > > > >>>> > solved in different ways with v1 pages. The main difference >> is >> > > that >> > > > v2 >> > > > >>>> > pages are broken at record boundaries, and v1 pages weren't >> > > > >>>> guaranteed to >> > > > >>>> > be. But, in order to write page indexes near the footer, >> > breaking >> > > > >>>> pages at >> > > > >>>> > record boundaries is required. So you know if you have page >> > > indexes, >> > > > >>>> you >> > > > >>>> > can actually use them to skip through pages safely. That >> removes >> > > > much >> > > > >>>> of >> > > > >>>> > the need for v2 pages. >> > > > >>>> > >> > > > >>>> > The main drawback to using v2 pages is that the v2 spec is >> > > > >>>> unfinished, and >> > > > >>>> > I don't think there is a way to use just the new pages. So >> you'd >> > > > >>>> possibly >> > > > >>>> > end up pulling in other beta features that probably >> shouldn't be >> > > > used >> > > > >>>> if >> > > > >>>> > you want to stick with what is required for compatibility >> across >> > > > >>>> > implementations. >> > > > >>>> > >> > > > >>>> > We only use v1 and it still works well. That said, I'd love >> to >> > > make >> > > > >>>> some >> > > > >>>> > progress on better encodings and finalizing v2 so we can use >> > them! >> > > > >>>> > >> > > > >>>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield < >> > > > >>>> emkornfi...@gmail.com> >> > > > >>>> > wrote: >> > > > >>>> > >> > > > >>>> >> What is the current status of support for Data Page V2? Is >> it >> > > > >>>> recommended >> > > > >>>> >> for production workloads? >> > > > >>>> >> >> > > > >>>> >> Thanks, >> > > > >>>> >> Micah >> > > > >>>> >> >> > > > >>>> > >> > > > >>>> > >> > > > >>>> > -- >> > > > >>>> > Ryan Blue >> > > > >>>> > Software Engineer >> > > > >>>> > Netflix >> > > > >>>> > >> > > > >>>> >> > > > >>> >> > > > >> > > >> > >> > >> > -- >> > Ryan Blue >> > Software Engineer >> > Netflix >> > >> >