Re: [Parquet] ALP Encoding for Floating point data

Robert Kruszewski Tue, 21 Oct 2025 00:51:26 -0700

The ALP implenentation in vortex logically behaves the same, however, since we 
support stacked encodings we don't need to handle the full compression flow, 
just the Float -> Integer transformation that ALP offers.


The paper indeed doesn't mention how special float values are handled but the 
logic exists in the published C++ implementation. Vortex leverages differences 
in cast semantics between Rust and C++ to make encoding faster 
https://spiraldb.com/post/alp-rust-is-faster-than-c. 

On Mon, 20 Oct 2025, at 19:56, Andrew Lamb wrote:
> Thanks again Prateek and co for pushing this along!
> 
> 
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
> 
> 100% agree with this (similar to what was done for ParquetVariant)
> 
> > 2. I may be missing something, but the paper doesn't seem to mention
> non-finite values (such as +/-Inf and NaNs).
> 
> I think they are handled via the "Exception" mechanism. Vortex's ALP
> implementation (below) does appear to handle finite numbers[2]
> 
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
> 
> My understanding from the call was that Prateek and team re-implemented
> ALP  (did not use the implementation from CWI[3]) but that would be good to
> confirm.
> 
> There is also a Rust implementation of ALP[1] that is part of the Vortex
> file format implementation. I have not reviewed it to see if it deviates
> from the algorithm presented in the paper.
> 
> Andrew
> 
> [1]:
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> [2]:
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> [3]: https://github.com/cwida/ALP
> 
> 
> On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou <[email protected]> wrote:
> 
> >
> > Hello,
> >
> > Thanks for doing this and I agree the numbers look impressive.
> >
> > I would ask if possible for more data points:
> >
> > 1. More datasets: you could for example look at the datasets that were
> > used to originally evalute BYTE_STREAM_SPLIT (see
> > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > the Google Doc linked there)
> >
> > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> >
> > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> > remain very good there as well
> >
> >
> > I also have the following reservations towards ALP:
> >
> > 1. There is no published official spec AFAICT, just a research paper.
> >
> > 2. I may be missing something, but the paper doesn't seem to mention
> > non-finite values (such as +/-Inf and NaNs).
> >
> > 3. It seems there is a single implementation, which is the one published
> > together with the paper. It is not obvious that it will be
> > maintained in the future, and reusing it is probably not an option for
> > non-C++ Parquet implementations
> >
> > 4. The encoding itself is complex, since it involves a fallback on
> > another encoding if the primary encoding (which constitutes the real
> > innovation) doesn't work out on a piece of data.
> >
> >
> > Based on this, I would say that if we think ALP is attractive for us,
> > we may want to incorporate our own version of ALP with the following
> > changes:
> >
> > 1. Design and write our own Parquet-ALP spec so that implementations
> > know exactly how to encode and represent data
> >
> > 2. Do not include the ALPrd fallback which is a homegrown dictionary
> > encoding without dictionary reuse accross pages, and instead rely on a
> > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> >
> > 3. Replace the FOR encoding inside ALP, which aims at compressing
> > integers efficiently, with our own DELTA_BINARY_PACKED (which has the
> > same qualities and is already available in Parquet implementations)
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > On Thu, 16 Oct 2025 14:47:33 -0700
> > PRATEEK GAUR <[email protected]> wrote:
> > > Hi team,
> > >
> > > We spent some time evaluating ALP compression and decompression compared
> > to
> > > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > > the recording, so please let me know what access rules I need to get to
> > be
> > > able to view it )
> > >
> > > We did this evaluation over some datasets pointed by the ALP paper and
> > some
> > > pointed by the parquet community.
> > >
> > > The results are available in the following document
> > > <
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> > >
> > > :
> > >
> > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> > >
> > > Based on the numbers we see
> > >
> > >    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio
> > and
> > >    much better compared to other schemes. (numbers in the sheet are bytes
> > >    needed to encode each value )
> > >    - ALP going quite well in terms of decompression speed (numbers in the
> > >    sheet are bytes decompressed per second)
> > >
> > > As next steps we will
> > >
> > >    - Get the numbers for compression on top of byte stream split.
> > >    - Evaluate the algorithm over a few more datasets.
> > >    - Have an implementation in the arrow-parquet repo.
> > >
> > > Looking forward to feedback from the community.
> > >
> > > Best
> > > Prateek and Dhirhan
> > >
> >
> >
> >
> >
>

Re: [Parquet] ALP Encoding for Floating point data

Reply via email to