Re: [Parquet] ALP Encoding for Floating point data

PRATEEK GAUR Wed, 29 Apr 2026 16:13:13 -0700

Thanks Andrew and Micah for review feedback on the two PR's
1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes
2) (parquet-format repo) https://github.com/apache/parquet-format/pull/557


I have addressed all (unless I missed something) comments on the two PR's.

Best
Prateek

On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote:

> Thanks Andrew and Micah.
>
> `fair amount of feedback on at least the implementations`
> For the c++ I have already started addressing the feedback, I should be
> done with that Monday/Tuesday.
> I think Vinoo too has been making good progress on the Java implementation.
>
> Best
> Prateek
>
> On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]>
> wrote:
>
>> Got it. Thank you for the clarification -- I will try and look into the
>> spec and the Rust implementation[1] in this next week
>>
>> [1]: https://github.com/apache/arrow-rs/pull/9372
>>
>> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> Hi Andrew,
>>> I think there is a fair amount of feedback on at least the
>>> implementations, typically I think we've waited till they are close to
>>> mergeable before a final vote.  Otherwise I agree we are very close.
>>>
>>> -Micah
>>>
>>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> wrote:
>>>
>>>> Thanks Prateek,
>>>>
>>>> I think from this content it looks to me like we are ready to start a
>>>> vote to explicitly accept ALP into Parquet
>>>>
>>>> Does anyone know of a reason we should postpone it for longer?
>>>> Perhaps someone needs some more time to review?
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi team,
>>>>>
>>>>>
>>>>>
>>>>> Hope everyone is doing well. I got a chance to work through all the
>>>>> remaining feedback and update the spec doc. Here are the new artifacts
>>>>>
>>>>> 1) Spec document :
>>>>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
>>>>>
>>>>> 2) Spec document in parquet format repo :
>>>>> https://github.com/apache/parquet-format/pull/557
>>>>>
>>>>> 3) Alp implementation in arrow c++ repo :
>>>>> https://github.com/apache/arrow/pull/48345/changes
>>>>>
>>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and Julien
>>>>>  https://github.com/apache/parquet-java/pull/3397
>>>>>
>>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo :
>>>>> https://github.com/apache/parquet-testing/pull/100
>>>>>
>>>>>
>>>>> And
>>>>>
>>>>>
>>>>>    - Go : Arnav just submitted an in progress implementation in Go.
>>>>>    https://github.com/apache/arrow-go/pull/704 (I haven't started
>>>>>    looking at it yet)
>>>>>    - Rust : I remember Andrew mentioned that this work is also in
>>>>>    progress (So 4 languages!)
>>>>>
>>>>>
>>>>> *Arrow C++ implementation *
>>>>>
>>>>>
>>>>>
>>>>> The PR is out and was also used by Antoine to report the numbers as
>>>>> reported here. Micah and Konstantin have given 1 round of feedback
>>>>> and I'm addressing them today. Please note that the default
>>>>> optimization flag for compiling is O2 and not Q3. I got around 70%
>>>>> performance improvement in the decoding speed when using the O3 flag.
>>>>>
>>>>>
>>>>>
>>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and **Cross
>>>>> Language testing*
>>>>>
>>>>>
>>>>>    Let me know if you have any questions or feedback.
>>>>>
>>>>>
>>>>>
>>>>> Now pasting some performance numbers
>>>>>
>>>>>
>>>>>   Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
>>>>> Neoverse V1)
>>>>>
>>>>>   ┌──────────────────┬──────────────┬──────────────┬─────────┐
>>>>>
>>>>>   │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
>>>>>
>>>>>   ├──────────────────┼──────────────┼──────────────┼─────────┤
>>>>>
>>>>>   │ valence          │     3,155    │     5,523    │  1.75x  │
>>>>>
>>>>>   │ danceability     │     3,233    │     5,685    │  1.76x  │
>>>>>
>>>>>   │ energy           │     3,197    │     5,652    │  1.77x  │
>>>>>
>>>>>   │ loudness         │     3,186    │     5,473    │  1.72x  │
>>>>>
>>>>>   └──────────────────┴──────────────┴──────────────┴─────────┘
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> @Micah Kornfield <[email protected]> : Got it.
>>>>>>
>>>>>> @Andrew Lamb <[email protected]>
>>>>>>
>>>>>>
>>>>>>> Do you think it would be good to start moving the spec development
>>>>>>> into
>>>>>>> markdown format, in preparation for finalizing it?
>>>>>>>
>>>>>>
>>>>>> Yes I'll update the numbers for some of the examples I have in the
>>>>>> spec based
>>>>>> on the updated header size. Then we should be good to go for the
>>>>>> markdown format.
>>>>>>
>>>>>> Thanks everyone!
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> > Hi team,
>>>>>>> >
>>>>>>> > 1) Andrew
>>>>>>> >
>>>>>>> >    - Thanks for working on test files. My PR did add all the test
>>>>>>> files I
>>>>>>> >    used to benchmark on datasets. Maybe we can club it together.
>>>>>>> WIll also
>>>>>>> > aid
>>>>>>> >    cross language testing
>>>>>>> >    -  Kosta Tarasov working on Rust implementation. This is great.
>>>>>>> Thanks
>>>>>>> >
>>>>>>> >
>>>>>>> > 2) Antoine
>>>>>>> >
>>>>>>> >    - Thanks a lot for reporting the numbers on AMD. Looks like you
>>>>>>> are
>>>>>>> >    getting 8X the decoding performance of BSS. This is amazing!!.
>>>>>>> >    - Thanks for acknowledging the sampling design.
>>>>>>> >    - I agree with you on Fastlanes. In some crude experiments I
>>>>>>> didn't get
>>>>>>> >    a good perf benefit from it on Graviton3 (but maybe there was
>>>>>>> something
>>>>>>> >    wrong with my implementation).
>>>>>>> >    - Locking the 16bit exception encoding for the spec in this
>>>>>>> case.
>>>>>>> >    - Awesome I think we have solved for all open questions minus
>>>>>>> the
>>>>>>> >    version byte :). (will get back on this soon)
>>>>>>> >
>>>>>>> >
>>>>>>> > 3) Micah
>>>>>>> >
>>>>>>> >    - FastLanes : The current spec does allow for using FastLane
>>>>>>> with the
>>>>>>> >    configurable enum value for layout. We should be able to inject
>>>>>>> any
>>>>>>> > layout
>>>>>>> >    in the current design.
>>>>>>> >
>>>>>>> >
>>>>>>> > Working on resolving all remaining open comments on the spec this
>>>>>>> week.
>>>>>>> >
>>>>>>> > Best
>>>>>>> > Prateek
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <
>>>>>>> [email protected]>
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <
>>>>>>> [email protected]>
>>>>>>> > > wrote:
>>>>>>> > >
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > > It looks like the actual issue described for ORC in the paper
>>>>>>> is that
>>>>>>> > it
>>>>>>> > > > has multiple sub-encodings in a batch.  This is different then
>>>>>>> the
>>>>>>> > design
>>>>>>> > > > proposed here where there is still fixed encoding per page in
>>>>>>> parquet.
>>>>>>> > > > Given reasonably sized pages I don't think branch
>>>>>>> misprediction should
>>>>>>> > > be a
>>>>>>> > > > big issue for new encodings.  I agree that we should be
>>>>>>> conservative in
>>>>>>> > > > general for adding new encodings.
>>>>>>> > > >
>>>>>>> > > >
>>>>>>> > > +1
>>>>>>> > >
>>>>>>> >
>>>>>>>
>>>>>>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to