Re: [Parquet] ALP Encoding for Floating point data

PRATEEK GAUR Wed, 22 Apr 2026 10:00:45 -0700

Hi team,



Hope everyone is doing well. I got a chance to work through all the
remaining feedback and update the spec doc. Here are the new artifacts

1) Spec document :
https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit

2) Spec document in parquet format repo :
https://github.com/apache/parquet-format/pull/557

3) Alp implementation in arrow c++ repo :
https://github.com/apache/arrow/pull/48345/changes

4) Alp implementation in parquet-java repo : Work for Vinoo and Julien
https://github.com/apache/parquet-java/pull/3397

5) PR with test and benchmarking artifacts in parquet-testing repo :
https://github.com/apache/parquet-testing/pull/100


And


   - Go : Arnav just submitted an in progress implementation in Go.
   https://github.com/apache/arrow-go/pull/704 (I haven't started looking
   at it yet)
   - Rust : I remember Andrew mentioned that this work is also in progress
   (So 4 languages!)


*Arrow C++ implementation *



The PR is out and was also used by Antoine to report the numbers as
reported here. Micah and Konstantin have given 1 round of feedback and I'm
addressing them today. Please note that the default optimization flag for
compiling is O2 and not Q3. I got around 70% performance improvement in the
decoding speed when using the O3 flag.



*Parqet-MR Java implementation (working with Vinoo and Julien) and **Cross
Language testing*


   Let me know if you have any questions or feedback.



Now pasting some performance numbers


  Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
Neoverse V1)

  ┌──────────────────┬──────────────┬──────────────┬─────────┐

  │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │

  ├──────────────────┼──────────────┼──────────────┼─────────┤

  │ valence          │     3,155    │     5,523    │  1.75x  │

  │ danceability     │     3,233    │     5,685    │  1.76x  │

  │ energy           │     3,197    │     5,652    │  1.77x  │

  │ loudness         │     3,186    │     5,473    │  1.72x  │

  └──────────────────┴──────────────┴──────────────┴─────────┘




On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]> wrote:

> @Micah Kornfield <[email protected]> : Got it.
>
> @Andrew Lamb <[email protected]>
>
>
>> Do you think it would be good to start moving the spec development into
>> markdown format, in preparation for finalizing it?
>>
>
> Yes I'll update the numbers for some of the examples I have in the spec
> based
> on the updated header size. Then we should be good to go for the markdown
> format.
>
> Thanks everyone!
>
>
>>
>> Andrew
>>
>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]> wrote:
>>
>> > Hi team,
>> >
>> > 1) Andrew
>> >
>> >    - Thanks for working on test files. My PR did add all the test files
>> I
>> >    used to benchmark on datasets. Maybe we can club it together. WIll
>> also
>> > aid
>> >    cross language testing
>> >    -  Kosta Tarasov working on Rust implementation. This is great.
>> Thanks
>> >
>> >
>> > 2) Antoine
>> >
>> >    - Thanks a lot for reporting the numbers on AMD. Looks like you are
>> >    getting 8X the decoding performance of BSS. This is amazing!!.
>> >    - Thanks for acknowledging the sampling design.
>> >    - I agree with you on Fastlanes. In some crude experiments I didn't
>> get
>> >    a good perf benefit from it on Graviton3 (but maybe there was
>> something
>> >    wrong with my implementation).
>> >    - Locking the 16bit exception encoding for the spec in this case.
>> >    - Awesome I think we have solved for all open questions minus the
>> >    version byte :). (will get back on this soon)
>> >
>> >
>> > 3) Micah
>> >
>> >    - FastLanes : The current spec does allow for using FastLane with the
>> >    configurable enum value for layout. We should be able to inject any
>> > layout
>> >    in the current design.
>> >
>> >
>> > Working on resolving all remaining open comments on the spec this week.
>> >
>> > Best
>> > Prateek
>> >
>> >
>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <[email protected]>
>> > wrote:
>> >
>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <[email protected]>
>> > > wrote:
>> > >
>> > > >
>> > > >
>> > > > It looks like the actual issue described for ORC in the paper is
>> that
>> > it
>> > > > has multiple sub-encodings in a batch.  This is different then the
>> > design
>> > > > proposed here where there is still fixed encoding per page in
>> parquet.
>> > > > Given reasonably sized pages I don't think branch misprediction
>> should
>> > > be a
>> > > > big issue for new encodings.  I agree that we should be
>> conservative in
>> > > > general for adding new encodings.
>> > > >
>> > > >
>> > > +1
>> > >
>> >
>>
>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to