Re: [Parquet] ALP Encoding for Floating point data

Curt Hagenlocher Wed, 29 Apr 2026 16:20:31 -0700

As part of the process of amending the Parquet format, perhaps it would be
a good idea for early implementations to generate sample files and commit
them to apache/parquet-testing: Apache Parquet Testing
<https://github.com/apache/parquet-testing> for other implementations to
leverage?


-Curt

On Wed, Apr 29, 2026 at 4:11 PM PRATEEK GAUR <[email protected]> wrote:

> Thanks Andrew and Micah for review feedback on the two PR's
> 1) (c++ arrow repo) https://github.com/apache/arrow/pull/48345/changes
> 2) (parquet-format repo) https://github.com/apache/parquet-format/pull/557
>
> I have addressed all (unless I missed something) comments on the two PR's.
>
> Best
> Prateek
>
> On Sat, Apr 25, 2026 at 1:08 PM PRATEEK GAUR <[email protected]> wrote:
>
> > Thanks Andrew and Micah.
> >
> > `fair amount of feedback on at least the implementations`
> > For the c++ I have already started addressing the feedback, I should be
> > done with that Monday/Tuesday.
> > I think Vinoo too has been making good progress on the Java
> implementation.
> >
> > Best
> > Prateek
> >
> > On Sat, Apr 25, 2026 at 12:55 PM Andrew Lamb <[email protected]>
> > wrote:
> >
> >> Got it. Thank you for the clarification -- I will try and look into the
> >> spec and the Rust implementation[1] in this next week
> >>
> >> [1]: https://github.com/apache/arrow-rs/pull/9372
> >>
> >> On Sat, Apr 25, 2026 at 12:01 PM Micah Kornfield <[email protected]
> >
> >> wrote:
> >>
> >>> Hi Andrew,
> >>> I think there is a fair amount of feedback on at least the
> >>> implementations, typically I think we've waited till they are close to
> >>> mergeable before a final vote.  Otherwise I agree we are very close.
> >>>
> >>> -Micah
> >>>
> >>> On Saturday, April 25, 2026, Andrew Lamb <[email protected]>
> wrote:
> >>>
> >>>> Thanks Prateek,
> >>>>
> >>>> I think from this content it looks to me like we are ready to start a
> >>>> vote to explicitly accept ALP into Parquet
> >>>>
> >>>> Does anyone know of a reason we should postpone it for longer?
> >>>> Perhaps someone needs some more time to review?
> >>>>
> >>>> Andrew
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Apr 22, 2026 at 1:00 PM PRATEEK GAUR <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi team,
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hope everyone is doing well. I got a chance to work through all the
> >>>>> remaining feedback and update the spec doc. Here are the new
> artifacts
> >>>>>
> >>>>> 1) Spec document :
> >>>>>
> https://docs.google.com/document/d/1xz2cudDpN2Y1ImFcTXh15s-3fPtD_aWt/edit
> >>>>>
> >>>>> 2) Spec document in parquet format repo :
> >>>>> https://github.com/apache/parquet-format/pull/557
> >>>>>
> >>>>> 3) Alp implementation in arrow c++ repo :
> >>>>> https://github.com/apache/arrow/pull/48345/changes
> >>>>>
> >>>>> 4) Alp implementation in parquet-java repo : Work for Vinoo and
> Julien
> >>>>>  https://github.com/apache/parquet-java/pull/3397
> >>>>>
> >>>>> 5) PR with test and benchmarking artifacts in parquet-testing repo :
> >>>>> https://github.com/apache/parquet-testing/pull/100
> >>>>>
> >>>>>
> >>>>> And
> >>>>>
> >>>>>
> >>>>>    - Go : Arnav just submitted an in progress implementation in Go.
> >>>>>    https://github.com/apache/arrow-go/pull/704 (I haven't started
> >>>>>    looking at it yet)
> >>>>>    - Rust : I remember Andrew mentioned that this work is also in
> >>>>>    progress (So 4 languages!)
> >>>>>
> >>>>>
> >>>>> *Arrow C++ implementation *
> >>>>>
> >>>>>
> >>>>>
> >>>>> The PR is out and was also used by Antoine to report the numbers as
> >>>>> reported here. Micah and Konstantin have given 1 round of feedback
> >>>>> and I'm addressing them today. Please note that the default
> >>>>> optimization flag for compiling is O2 and not Q3. I got around 70%
> >>>>> performance improvement in the decoding speed when using the O3 flag.
> >>>>>
> >>>>>
> >>>>>
> >>>>> *Parqet-MR Java implementation (working with Vinoo and Julien) and
> **Cross
> >>>>> Language testing*
> >>>>>
> >>>>>
> >>>>>    Let me know if you have any questions or feedback.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Now pasting some performance numbers
> >>>>>
> >>>>>
> >>>>>   Table 1: C++ ALP Double Decode — Spotify Columns (Graviton 3, ARM
> >>>>> Neoverse V1)
> >>>>>
> >>>>>   ┌──────────────────┬──────────────┬──────────────┬─────────┐
> >>>>>
> >>>>>   │ Column           │  -O2 (MB/s)  │  -O3 (MB/s)  │ Speedup │
> >>>>>
> >>>>>   ├──────────────────┼──────────────┼──────────────┼─────────┤
> >>>>>
> >>>>>   │ valence          │     3,155    │     5,523    │  1.75x  │
> >>>>>
> >>>>>   │ danceability     │     3,233    │     5,685    │  1.76x  │
> >>>>>
> >>>>>   │ energy           │     3,197    │     5,652    │  1.77x  │
> >>>>>
> >>>>>   │ loudness         │     3,186    │     5,473    │  1.72x  │
> >>>>>
> >>>>>   └──────────────────┴──────────────┴──────────────┴─────────┘
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 25, 2026 at 9:49 AM PRATEEK GAUR <[email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> @Micah Kornfield <[email protected]> : Got it.
> >>>>>>
> >>>>>> @Andrew Lamb <[email protected]>
> >>>>>>
> >>>>>>
> >>>>>>> Do you think it would be good to start moving the spec development
> >>>>>>> into
> >>>>>>> markdown format, in preparation for finalizing it?
> >>>>>>>
> >>>>>>
> >>>>>> Yes I'll update the numbers for some of the examples I have in the
> >>>>>> spec based
> >>>>>> on the updated header size. Then we should be good to go for the
> >>>>>> markdown format.
> >>>>>>
> >>>>>> Thanks everyone!
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Andrew
> >>>>>>>
> >>>>>>> On Tue, Feb 17, 2026 at 7:28 PM PRATEEK GAUR <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> > Hi team,
> >>>>>>> >
> >>>>>>> > 1) Andrew
> >>>>>>> >
> >>>>>>> >    - Thanks for working on test files. My PR did add all the test
> >>>>>>> files I
> >>>>>>> >    used to benchmark on datasets. Maybe we can club it together.
> >>>>>>> WIll also
> >>>>>>> > aid
> >>>>>>> >    cross language testing
> >>>>>>> >    -  Kosta Tarasov working on Rust implementation. This is
> great.
> >>>>>>> Thanks
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > 2) Antoine
> >>>>>>> >
> >>>>>>> >    - Thanks a lot for reporting the numbers on AMD. Looks like
> you
> >>>>>>> are
> >>>>>>> >    getting 8X the decoding performance of BSS. This is amazing!!.
> >>>>>>> >    - Thanks for acknowledging the sampling design.
> >>>>>>> >    - I agree with you on Fastlanes. In some crude experiments I
> >>>>>>> didn't get
> >>>>>>> >    a good perf benefit from it on Graviton3 (but maybe there was
> >>>>>>> something
> >>>>>>> >    wrong with my implementation).
> >>>>>>> >    - Locking the 16bit exception encoding for the spec in this
> >>>>>>> case.
> >>>>>>> >    - Awesome I think we have solved for all open questions minus
> >>>>>>> the
> >>>>>>> >    version byte :). (will get back on this soon)
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > 3) Micah
> >>>>>>> >
> >>>>>>> >    - FastLanes : The current spec does allow for using FastLane
> >>>>>>> with the
> >>>>>>> >    configurable enum value for layout. We should be able to
> inject
> >>>>>>> any
> >>>>>>> > layout
> >>>>>>> >    in the current design.
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > Working on resolving all remaining open comments on the spec this
> >>>>>>> week.
> >>>>>>> >
> >>>>>>> > Best
> >>>>>>> > Prateek
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > On Tue, Feb 10, 2026 at 3:37 AM Steve Loughran <
> >>>>>>> [email protected]>
> >>>>>>> > wrote:
> >>>>>>> >
> >>>>>>> > > On Sun, 8 Feb 2026 at 18:12, Micah Kornfield <
> >>>>>>> [email protected]>
> >>>>>>> > > wrote:
> >>>>>>> > >
> >>>>>>> > > >
> >>>>>>> > > >
> >>>>>>> > > > It looks like the actual issue described for ORC in the paper
> >>>>>>> is that
> >>>>>>> > it
> >>>>>>> > > > has multiple sub-encodings in a batch.  This is different
> then
> >>>>>>> the
> >>>>>>> > design
> >>>>>>> > > > proposed here where there is still fixed encoding per page in
> >>>>>>> parquet.
> >>>>>>> > > > Given reasonably sized pages I don't think branch
> >>>>>>> misprediction should
> >>>>>>> > > be a
> >>>>>>> > > > big issue for new encodings.  I agree that we should be
> >>>>>>> conservative in
> >>>>>>> > > > general for adding new encodings.
> >>>>>>> > > >
> >>>>>>> > > >
> >>>>>>> > > +1
> >>>>>>> > >
> >>>>>>> >
> >>>>>>>
> >>>>>>
>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to