Re: [Parquet] ALP Encoding for Floating point data

Antoine Pitrou Mon, 24 Nov 2025 01:18:07 -0800


I would recommend to not get carried away with AVX512, as it's still
missing from many recent Intel CPUs. AVX2 is the current sweet spot for
SIMD on x86, IMHO.


Regards

Antoine.


On Sat, 22 Nov 2025 12:49:15 +0000
Steve Loughran <[email protected]> wrote:
> First, sorry: I think I accidentally marked as done the comment in the doc
> about x86 performance.
> 
> Those x86 numbers are critical, especially AVX512 in a recent intel part.
> There's a notorious feature in the early ones where the cores would reduce
> frequency after you used the opcodes as a way of managing die temperature (
> https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency
> ); the later ones and AMD models are the ones to worry about.
> 
> FWIW in hadoop we are starting to see RISC-V PRs for CRC performance, which
> boosts throughput reading data from hdfs or even locally if you haven't
> turned crc checks off. I wouldn't worry about RISC-V for parquet FP *yet*,
> but it's interesting to see that work appearing, especially in the context
> of the EU's active development of a sovereign cloud (i.e. one the US govt
> can't disable on an order from their president)
> https://cordis.europa.eu/project/id/101092993
> 
> 
> 
> On Sat, 22 Nov 2025 at 04:15, Prateek Gaur via dev 
> <[email protected]>
> wrote:
> 
> > Hi team,
> >
> > *ALP ---> ALP PeudoDecimal*
> >
> > As is visible from the numbers above and as stated in the paper too for
> > real double values, i.e the values with high precision points, it is very
> > difficult to get a good compression ratio.
> >
> > This combined with the fact that we want to keep the spec/implementation
> > simpler, stating Antoine directly here
> >
> > `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
> >
> > *encoding without dictionary reuse accross pages, and instead rely on
> > awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
> >
> > Also based on some discussion I had with Julien in person and the biweekly
> > meeting with a number of you.
> >
> > We'll be going with ALPpd (pseudo decimal) as the first
> > implementation relying on the query engine based on its own heuristics to
> > decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
> >
> > Best
> > Prateek
> >
> >
> >
> > On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <[email protected]>
> > wrote:
> >  
> > > Sheet with numbers
> > > <  
> > https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
> >   
> > >
> > > .
> > >
> > > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR 
> > > <[email protected]> wrote:
> > >  
> > >> Hi team,
> > >>
> > >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
> > >> remember correctly, to perform the experiment on some of the papers that
> > >> talked about BYTE_STREAM_SPLIT for completeness.
> > >> I wanted to share the numbers for the same in this sheet. At this point
> > >> we have numbers on a wide variety of data.
> > >> (Will have to share the sheet from my snowflake account as our laptops
> > >> have fair bit of restriction with respect to copy paste permissions :) )
> > >>
> > >> Best
> > >> Prateek
> > >>
> > >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]>  
> > wrote:  
> > >>  
> > >>> Hi Julien,
> > >>>
> > >>> Yes based on
> > >>>
> > >>>    - Numbers presented
> > >>>    - Discussions over the doc and
> > >>>    - Multiple discussions in the biweekly meeting
> > >>>
> > >>> We are in a stage where we agree this is the right encoding to add and
> > >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> > >>> Will start working on the PR for the same.
> > >>>
> > >>> Thanks for bringing this up.
> > >>> Prateek
> > >>>
> > >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]>  
> > wrote:  
> > >>>  
> > >>>> @PRATEEK GAUR <[email protected]> : Would you agree that we are past
> > >>>> the DISCUSS step and into the DRAFT/POC phase according to the  
> > proposals  
> > >>>> process <  
> > https://github.com/apache/parquet-format/tree/master/proposals  
> > >>>> >?  
> > >>>> If yes, could you open a PR on this page to add this proposal to the
> > >>>> list?
> > >>>> https://github.com/apache/parquet-format/tree/master/proposals
> > >>>> Thank you!
> > >>>>
> > >>>>
> > >>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]>
> > >>>> wrote:
> > >>>>  
> > >>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the 
> > >>>> >  
> > >>>> Rust  
> > >>>> > Parquet reader if anyone is interested
> > >>>> >
> > >>>> > Andrew
> > >>>> >
> > >>>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
> > >>>> >
> > >>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <  
> > >>>> [email protected]>  
> > >>>> > wrote:
> > >>>> >  
> > >>>> > > >
> > >>>> > > > C++, Java and Rust support them for sure. I feel like we should
> > >>>> > > > probably default to V2 at some point.  
> > >>>> > >
> > >>>> > >
> > >>>> > > I seem to recall, some of the vectorized java readers (Iceberg,  
> > >>>> Spark)  
> > >>>> > > might not support V2 data pages (but I might be confusing this  
> > with  
> > >>>> > > encodings).  But this is only a vague recollection.
> > >>>> > >
> > >>>> > >
> > >>>> > >
> > >>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <  
> > [email protected]  
> > >>>> >  
> > >>>> > > wrote:
> > >>>> > >  
> > >>>> > > > > Someone has to add V2 data pages to
> > >>>> > > > >  
> > >>>> > > >
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >   
> > >>>> > > > > :)  
> > >>>> > > >
> > >>>> > > > Your wish is my command:  
> > >>>> > https://github.com/apache/parquet-site/pull/124  
> > >>>> > > >
> > >>>> > > > As the format grows in popularity and momentum builds to evolve, 
> > >>>> > > >  
> > >>>> I feel  
> > >>>> > > the  
> > >>>> > > > content on the parquet.apache.org site could use refreshing /  
> > >>>> > updating.  
> > >>>> > > > So, while I had the site open, I made some other PRs to scratch  
> > >>>> various  
> > >>>> > > > itches
> > >>>> > > >
> > >>>> > > > (I am absolutely 🎣 for someone to please review 🙏):
> > >>>> > > >
> > >>>> > > > 1. Add Variant/Geometry/Geography types to implementation status 
> > >>>> > > >  
> > >>>> > matrix:  
> > >>>> > > > https://github.com/apache/parquet-site/pull/123
> > >>>> > > > 2. Improve introduction / overview, add more links to spec and
> > >>>> > > > implementation status:  
> > >>>> https://github.com/apache/parquet-site/pull/125  
> > >>>> > > >
> > >>>> > > >
> > >>>> > > > Thanks,
> > >>>> > > > Andrew
> > >>>> > > >
> > >>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <  
> > >>>> [email protected]>  
> > >>>> > > wrote:  
> > >>>> > > >  
> > >>>> > > > >
> > >>>> > > > > Hi Julien, hi all,
> > >>>> > > > >
> > >>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
> > >>>> > > > > Julien Le Dem <[email protected]> 
> > >>>> > > > > wrote:  
> > >>>> > > > > >
> > >>>> > > > > > Another question from me:
> > >>>> > > > > >
> > >>>> > > > > > Since the goal is to not use compression at all in this case 
> > >>>> > > > > >  
> > >>>> (no  
> > >>>> > > ZSTD)  
> > >>>> > > > > > I'm assuming we would be using either:
> > >>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the  
> > >>>> ColumnMetadata.column  
> > >>>> > > > > > <  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >   
> > >>>> > > > > >
> > >>>> > > > > > field.
> > >>>> > > > > > - the Data Page V2 with false in the  
> > >>>> DataPageHeaderV2.is_compressed  
> > >>>> > > > > > <  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >   
> > >>>> > > > > >
> > >>>> > > > > > field
> > >>>> > > > > > The second helping decide if we can selectively compress  
> > some  
> > >>>> pages  
> > >>>> > > if  
> > >>>> > > > > they  
> > >>>> > > > > > are less compressed by the
> > >>>> > > > > > A few years ago there was a question on the support of the  
> > >>>> > > DATA_PAGE_V2  
> > >>>> > > > > and  
> > >>>> > > > > > I was curious to hear a refresh on how that's generally  
> > >>>> supported  
> > >>>> > in  
> > >>>> > > > > > Parquet implementations. The is_compressed field was exactly 
> > >>>> > > > > >  
> > >>>> > intended  
> > >>>> > > > to  
> > >>>> > > > > > avoid block compression when the encoding itself is good  
> > >>>> enough.  
> > >>>> > > > >
> > >>>> > > > > Someone has to add V2 data pages to
> > >>>> > > > >
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >   
> > >>>> > > > > :)
> > >>>> > > > >
> > >>>> > > > > C++, Java and Rust support them for sure. I feel like we  
> > should  
> > >>>> > > > > probably default to V2 at some point.
> > >>>> > > > >
> > >>>> > > > > Also see https://github.com/apache/parquet-java/issues/3344  
> > for  
> > >>>> > Java.  
> > >>>> > > > >
> > >>>> > > > > Regards
> > >>>> > > > >
> > >>>> > > > > Antoine.
> > >>>> > > > >
> > >>>> > > > >  
> > >>>> > > > > >
> > >>>> > > > > > Julien
> > >>>> > > > > >
> > >>>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb  
> > >>>> > > > > <andrewlamb11-Re5JQEeQqe8-XMD5yJDbdMReXY1tMh2IBgC/[email protected]>
> > >>>> > > > >  wrote:  
> > >>>> > > > > >  
> > >>>> > > > > > > Thanks again Prateek and co for pushing this along!
> > >>>> > > > > > >
> > >>>> > > > > > >  
> > >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that  
> > >>>> > > > implementations  
> > >>>> > > > > > > > know exactly how to encode and represent data  
> > >>>> > > > > > >
> > >>>> > > > > > > 100% agree with this (similar to what was done for  
> > >>>> > ParquetVariant)  
> > >>>> > > > > > >  
> > >>>> > > > > > > > 2. I may be missing something, but the paper doesn't  
> > seem  
> > >>>> to  
> > >>>> > > > > mention  
> > >>>> > > > > > > non-finite values (such as +/-Inf and NaNs).
> > >>>> > > > > > >
> > >>>> > > > > > > I think they are handled via the "Exception" mechanism.  
> > >>>> Vortex's  
> > >>>> > > ALP  
> > >>>> > > > > > > implementation (below) does appear to handle finite  
> > >>>> numbers[2]  
> > >>>> > > > > > >  
> > >>>> > > > > > > > 3. It seems there is a single implementation, which is  
> > >>>> the one  
> > >>>> > > > > published  
> > >>>> > > > > > > > together with the paper. It is not obvious that it will  
> > be  
> > >>>> > > > > > > > maintained in the future, and reusing it is probably not 
> > >>>> > > > > > > >  
> > >>>> an  
> > >>>> > > option  
> > >>>> > > > > for  
> > >>>> > > > > > > > non-C++ Parquet implementations  
> > >>>> > > > > > >
> > >>>> > > > > > > My understanding from the call was that Prateek and team  
> > >>>> > > > re-implemented  
> > >>>> > > > > > > ALP  (did not use the implementation from CWI[3]) but that 
> > >>>> > > > > > >  
> > >>>> would  
> > >>>> > be  
> > >>>> > > > > good to  
> > >>>> > > > > > > confirm.
> > >>>> > > > > > >
> > >>>> > > > > > > There is also a Rust implementation of ALP[1] that is part 
> > >>>> > > > > > >  
> > >>>> of the  
> > >>>> > > > > Vortex  
> > >>>> > > > > > > file format implementation. I have not reviewed it to see  
> > >>>> if it  
> > >>>> > > > > deviates  
> > >>>> > > > > > > from the algorithm presented in the paper.
> > >>>> > > > > > >
> > >>>> > > > > > > Andrew
> > >>>> > > > > > >
> > >>>> > > > > > > [1]:
> > >>>> > > > > > >
> > >>>> > > > > > >  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> >   
> > >>>> > > > > > > [2]:
> > >>>> > > > > > >
> > >>>> > > > > > >  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> >   
> > >>>> > > > > > > [3]: https://github.com/cwida/ALP
> > >>>> > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou  
> > >>>> > > > > <antoine-+ZN9ApsXKcFQFI55V6+gNQ-XMD5yJDbdMReXY1tMh2IBti2O/[email protected]>
> > >>>> > > > >  wrote:  
> > >>>> > > > > > >  
> > >>>> > > > > > > >
> > >>>> > > > > > > > Hello,
> > >>>> > > > > > > >
> > >>>> > > > > > > > Thanks for doing this and I agree the numbers look  
> > >>>> impressive.  
> > >>>> > > > > > > >
> > >>>> > > > > > > > I would ask if possible for more data points:
> > >>>> > > > > > > >
> > >>>> > > > > > > > 1. More datasets: you could for example look at the  
> > >>>> datasets  
> > >>>> > that  
> > >>>> > > > > were  
> > >>>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see
> > >>>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and  
> > >>>> > > > specifically  
> > >>>> > > > > > > > the Google Doc linked there)
> > >>>> > > > > > > >
> > >>>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and  
> > >>>> BYTE_STREAM_SPLIT  
> > >>>> > +  
> > >>>> > > > ZSTD  
> > >>>> > > > > > > >
> > >>>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I  
> > expect  
> > >>>> that  
> > >>>> > > ALP  
> > >>>> > > > > will  
> > >>>> > > > > > > > remain very good there as well
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > I also have the following reservations towards ALP:
> > >>>> > > > > > > >
> > >>>> > > > > > > > 1. There is no published official spec AFAICT, just a  
> > >>>> research  
> > >>>> > > > paper.  
> > >>>> > > > > > > >
> > >>>> > > > > > > > 2. I may be missing something, but the paper doesn't  
> > seem  
> > >>>> to  
> > >>>> > > > mention  
> > >>>> > > > > > > > non-finite values (such as +/-Inf and NaNs).
> > >>>> > > > > > > >
> > >>>> > > > > > > > 3. It seems there is a single implementation, which is  
> > >>>> the one  
> > >>>> > > > > published  
> > >>>> > > > > > > > together with the paper. It is not obvious that it will  
> > be  
> > >>>> > > > > > > > maintained in the future, and reusing it is probably not 
> > >>>> > > > > > > >  
> > >>>> an  
> > >>>> > > option  
> > >>>> > > > > for  
> > >>>> > > > > > > > non-C++ Parquet implementations
> > >>>> > > > > > > >
> > >>>> > > > > > > > 4. The encoding itself is complex, since it involves a  
> > >>>> fallback  
> > >>>> > > on  
> > >>>> > > > > > > > another encoding if the primary encoding (which  
> > >>>> constitutes the  
> > >>>> > > > real  
> > >>>> > > > > > > > innovation) doesn't work out on a piece of data.
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > Based on this, I would say that if we think ALP is  
> > >>>> attractive  
> > >>>> > for  
> > >>>> > > > us,  
> > >>>> > > > > > > > we may want to incorporate our own version of ALP with  
> > the  
> > >>>> > > > following  
> > >>>> > > > > > > > changes:
> > >>>> > > > > > > >
> > >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that  
> > >>>> > > > implementations  
> > >>>> > > > > > > > know exactly how to encode and represent data
> > >>>> > > > > > > >
> > >>>> > > > > > > > 2. Do not include the ALPrd fallback which is a  
> > homegrown  
> > >>>> > > > dictionary  
> > >>>> > > > > > > > encoding without dictionary reuse accross pages, and  
> > >>>> instead  
> > >>>> > rely  
> > >>>> > > > on  
> > >>>> > > > > a  
> > >>>> > > > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> > >>>> > > > > > > >
> > >>>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at  
> > >>>> > compressing  
> > >>>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED  
> > >>>> (which  
> > >>>> > has  
> > >>>> > > > the  
> > >>>> > > > > > > > same qualities and is already available in Parquet  
> > >>>> > > implementations)  
> > >>>> > > > > > > >
> > >>>> > > > > > > > Regards
> > >>>> > > > > > > >
> > >>>> > > > > > > > Antoine.
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700
> > >>>> > > > > > > > PRATEEK GAUR 
> > >>>> > > > > > > > <[email protected]> 
> > >>>> > > > > > > > wrote:  
> > >>>> > > > > > > > > Hi team,
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > We spent some time evaluating ALP compression and  
> > >>>> > decompression  
> > >>>> > > > > > > compared  
> > >>>> > > > > > > > to  
> > >>>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and  
> > >>>> > compression  
> > >>>> > > > > > > techniques  
> > >>>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to  
> > the  
> > >>>> > > community  
> > >>>> > > > > > > members  
> > >>>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I  
> > >>>> can't  
> > >>>> > seem  
> > >>>> > > > > to  
> > >>>> > > > > > > access  
> > >>>> > > > > > > > > the recording, so please let me know what access rules 
> > >>>> > > > > > > > >  
> > >>>> I need  
> > >>>> > > to  
> > >>>> > > > > get to  
> > >>>> > > > > > > > be  
> > >>>> > > > > > > > > able to view it )
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > We did this evaluation over some datasets pointed by  
> > >>>> the ALP  
> > >>>> > > > paper  
> > >>>> > > > > and  
> > >>>> > > > > > > > some  
> > >>>> > > > > > > > > pointed by the parquet community.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > The results are available in the following document
> > >>>> > > > > > > > > <  
> > >>>> > > > > > > >  
> > >>>> > > > > > >  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >   
> > >>>> > > > >  
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > :
> > >>>> > > > > > > > >  
> > >>>> > > > > > > >  
> > >>>> > > > > > >  
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >   
> > >>>> > > > >  
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Based on the numbers we see
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >    -  ALP is comparable to ZSTD(level=1) in terms of  
> > >>>> > > compression  
> > >>>> > > > > ratio  
> > >>>> > > > > > > > and  
> > >>>> > > > > > > > >    much better compared to other schemes. (numbers in  
> > >>>> the  
> > >>>> > sheet  
> > >>>> > > > > are  
> > >>>> > > > > > > bytes  
> > >>>> > > > > > > > >    needed to encode each value )
> > >>>> > > > > > > > >    - ALP going quite well in terms of decompression  
> > >>>> speed  
> > >>>> > > > (numbers  
> > >>>> > > > > in  
> > >>>> > > > > > > the  
> > >>>> > > > > > > > >    sheet are bytes decompressed per second)
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > As next steps we will
> > >>>> > > > > > > > >
> > >>>> > > > > > > > >    - Get the numbers for compression on top of byte  
> > >>>> stream  
> > >>>> > > split.  
> > >>>> > > > > > > > >    - Evaluate the algorithm over a few more datasets.
> > >>>> > > > > > > > >    - Have an implementation in the arrow-parquet repo.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Looking forward to feedback from the community.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Best
> > >>>> > > > > > > > > Prateek and Dhirhan
> > >>>> > > > > > > > >  
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > > >  
> > >>>> > > > > > >  
> > >>>> > > > > >  
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > >  
> > >>>> > > >  
> > >>>> > >  
> > >>>> >  
> > >>>>  
> > >>>  
> >  
>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to