Re: Current status of Data Page V2?

Micah Kornfield Fri, 09 Oct 2020 14:25:34 -0700

>
> The big win in v2 pages (if I remember correctly) is that the variable
> length encoding is no longer interleaved. That would provide a big
> performance lift when pulling into arrow vectors (and variable length
> decoding typically dominates total read processing time, on average I've
> seen 5-10x per cell cpu cost increase for variable reads over scalar
> reads). AFAIK, there is still no option for that in V1.



For Delta-length byte the documentation [1]   states "This encoding is
always preferred over PLAIN for byte array columns." makes it sound like
this could be part of V1 or V2.   The encoding enum in the Thrift file [2]
doesn't seem to document this either.

Is there clearer documentation for what encodings are considered part of v1
vs v2?

Thanks,
Micah


[1]
https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
[2]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L407

On Fri, Oct 9, 2020 at 11:37 AM Jacques Nadeau <jacq...@apache.org> wrote:

> The big win in v2 pages (if I remember correctly) is that the variable
> length encoding is no longer interleaved. That would provide a big
> performance lift when pulling into arrow vectors (and variable length
> decoding typically dominates total read processing time, on average I've
> seen 5-10x per cell cpu cost increase for variable reads over scalar
> reads). AFAIK, there is still no option for that in V1.
>
> On Thu, Oct 8, 2020 at 12:59 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Thanks for the quick reply Ryan.
>>
>>
>> > We only use v1 and it still works well. That said, I'd love to make some
>> > progress on better encodings and finalizing v2 so we can use them!
>>
>>
>> Are there JIRAs or other documentation that is tracking this work?
>>
>> Thanks,
>> Micah
>>
>>
>> On Thu, Oct 8, 2020 at 12:55 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>> > While there isn't anything wrong with it, the same challenges have been
>> > solved in different ways with v1 pages. The main difference is that v2
>> > pages are broken at record boundaries, and v1 pages weren't guaranteed
>> to
>> > be. But, in order to write page indexes near the footer, breaking pages
>> at
>> > record boundaries is required. So you know if you have page indexes, you
>> > can actually use them to skip through pages safely. That removes much of
>> > the need for v2 pages.
>> >
>> > The main drawback to using v2 pages is that the v2 spec is unfinished,
>> and
>> > I don't think there is a way to use just the new pages. So you'd
>> possibly
>> > end up pulling in other beta features that probably shouldn't be used if
>> > you want to stick with what is required for compatibility across
>> > implementations.
>> >
>> > We only use v1 and it still works well. That said, I'd love to make some
>> > progress on better encodings and finalizing v2 so we can use them!
>> >
>> > On Thu, Oct 8, 2020 at 12:44 PM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> >> What is the current status of support for Data Page V2?  Is it
>> recommended
>> >> for production workloads?
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>>
>

Re: Current status of Data Page V2?

Reply via email to