Re: [Parquet] ALP Encoding for Floating point data

PRATEEK GAUR Sat, 25 Apr 2026 13:09:12 -0700

Thanks Andrew and Micah.

`fair amount of feedback on at least the implementations`
For the c++ I have already started addressing the feedback, I should be
done with that Monday/Tuesday.
I think Vinoo too has been making good progress on the Java implementation.


Best
Prateek

On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]> wrote:

> Got it. Thank you for the clarification -- I will try and look into the
> spec and the Rust implementation[1] in this next week
>
> [1]: https://github.com/apache/arrow-rs/pull/9372
>
> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <[email protected]>
> wrote:
>
>> Hi Andrew,
>> I think there is a fair amount of feedback on at least the
>> implementations, typically I think we've waited till they are close to
>> mergeable before a final vote.  Otherwise I agree we are very close.
>>
>> -Micah
>>
>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]> wrote:
>>
>>> Thanks Prateek,
>>>
>>> I think from this content it looks to me like we are ready to start a
>>> vote to explicitly accept ALP into Parquet
>>>
>>> Does anyone know of a reason we should postpone it for longer?
>>> Perhaps someone needs some more time to review?
>>>
>>> Andrew
>>>
>>>
>>>
>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]> wrote:
>>>
>>>> Hi team,
>>>>
>>>>
>>>>
>>>> Hope everyone is doing well. I got a chance to work through all the
>>>> remaining feedback and update the spec doc. Here are the new artifacts
>>>>
>>>> 1) Spec document :
>>>> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
>>>>
>>>> 2) Spec document in parquet format repo :
>>>> https://github.com/apache/parquet-format/pull/557
>>>>
>>>> 3) Alp implementation in arrow c++ repo :
>>>> https://github.com/apache/arrow/pull/48345/changes
>>>>
>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and Julien
>>>> https://github.com/apache/parquet-java/pull/3397
>>>>
>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo :
>>>> https://github.com/apache/parquet-testing/pull/100
>>>>
>>>>
>>>> And
>>>>
>>>>
>>>>    - Go : Arnav just submitted an in progress implementation in Go.
>>>>    https://github.com/apache/arrow-go/pull/704 (I haven't started
>>>>    looking at it yet)
>>>>    - Rust : I remember Andrew mentioned that this work is also in
>>>>    progress (So 4 languages!)
>>>>
>>>>
>>>> *Arrow C++ implementation *
>>>>
>>>>
>>>>
>>>> The PR is out and was also used by Antoine to report the numbers as
>>>> reported here. Micah and Konstantin have given 1 round of feedback and
>>>> I'm addressing them today. Please note that the default optimization
>>>> flag for compiling is O2 and not Q3. I got around 70% performance
>>>> improvement in the decoding speed when using the O3 flag.
>>>>
>>>>
>>>>
>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and **Cross
>>>> Language testing*
>>>>
>>>>
>>>>    Let me know if you have any questions or feedback.
>>>>
>>>>
>>>>
>>>> Now pasting some performance numbers
>>>>
>>>>
>>>>   Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
>>>> Neoverse V1)
>>>>
>>>>   ┌──────────────────┬──────────────┬──────────────┬─────────┐
>>>>
>>>>   │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
>>>>
>>>>   ├──────────────────┼──────────────┼──────────────┼─────────┤
>>>>
>>>>   │ valence          │     3,155    │     5,523    │  1.75x  │
>>>>
>>>>   │ danceability     │     3,233    │     5,685    │  1.76x  │
>>>>
>>>>   │ energy           │     3,197    │     5,652    │  1.77x  │
>>>>
>>>>   │ loudness         │     3,186    │     5,473    │  1.72x  │
>>>>
>>>>   └──────────────────┴──────────────┴──────────────┴─────────┘
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
>>>> wrote:
>>>>
>>>>> @Micah Kornfield <[email protected]> : Got it.
>>>>>
>>>>> @Andrew Lamb <[email protected]>
>>>>>
>>>>>
>>>>>> Do you think it would be good to start moving the spec development
>>>>>> into
>>>>>> markdown format, in preparation for finalizing it?
>>>>>>
>>>>>
>>>>> Yes I'll update the numbers for some of the examples I have in the
>>>>> spec based
>>>>> on the updated header size. Then we should be good to go for the
>>>>> markdown format.
>>>>>
>>>>> Thanks everyone!
>>>>>
>>>>>
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> > Hi team,
>>>>>> >
>>>>>> > 1) Andrew
>>>>>> >
>>>>>> >    - Thanks for working on test files. My PR did add all the test
>>>>>> files I
>>>>>> >    used to benchmark on datasets. Maybe we can club it together.
>>>>>> WIll also
>>>>>> > aid
>>>>>> >    cross language testing
>>>>>> >    -  Kosta Tarasov working on Rust implementation. This is great.
>>>>>> Thanks
>>>>>> >
>>>>>> >
>>>>>> > 2) Antoine
>>>>>> >
>>>>>> >    - Thanks a lot for reporting the numbers on AMD. Looks like you
>>>>>> are
>>>>>> >    getting 8X the decoding performance of BSS. This is amazing!!.
>>>>>> >    - Thanks for acknowledging the sampling design.
>>>>>> >    - I agree with you on Fastlanes. In some crude experiments I
>>>>>> didn't get
>>>>>> >    a good perf benefit from it on Graviton3 (but maybe there was
>>>>>> something
>>>>>> >    wrong with my implementation).
>>>>>> >    - Locking the 16bit exception encoding for the spec in this case.
>>>>>> >    - Awesome I think we have solved for all open questions minus the
>>>>>> >    version byte :). (will get back on this soon)
>>>>>> >
>>>>>> >
>>>>>> > 3) Micah
>>>>>> >
>>>>>> >    - FastLanes : The current spec does allow for using FastLane
>>>>>> with the
>>>>>> >    configurable enum value for layout. We should be able to inject
>>>>>> any
>>>>>> > layout
>>>>>> >    in the current design.
>>>>>> >
>>>>>> >
>>>>>> > Working on resolving all remaining open comments on the spec this
>>>>>> week.
>>>>>> >
>>>>>> > Best
>>>>>> > Prateek
>>>>>> >
>>>>>> >
>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <[email protected]
>>>>>> >
>>>>>> > wrote:
>>>>>> >
>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <
>>>>>> [email protected]>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > It looks like the actual issue described for ORC in the paper
>>>>>> is that
>>>>>> > it
>>>>>> > > > has multiple sub-encodings in a batch.  This is different then
>>>>>> the
>>>>>> > design
>>>>>> > > > proposed here where there is still fixed encoding per page in
>>>>>> parquet.
>>>>>> > > > Given reasonably sized pages I don't think branch misprediction
>>>>>> should
>>>>>> > > be a
>>>>>> > > > big issue for new encodings.  I agree that we should be
>>>>>> conservative in
>>>>>> > > > general for adding new encodings.
>>>>>> > > >
>>>>>> > > >
>>>>>> > > +1
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to