Thanks again Prateek and co for pushing this along!

> 1. Design and write our own Parquet-ALP spec so that implementations
> know exactly how to encode and represent data

100% agree with this (similar to what was done for ParquetVariant)

> 2. I may be missing something, but the paper doesn't seem to mention
non-finite values (such as +/-Inf and NaNs).

I think they are handled via the "Exception" mechanism. Vortex's ALP
implementation (below) does appear to handle finite numbers[2]

> 3. It seems there is a single implementation, which is the one published
> together with the paper. It is not obvious that it will be
> maintained in the future, and reusing it is probably not an option for
> non-C++ Parquet implementations

My understanding from the call was that Prateek and team re-implemented
ALP  (did not use the implementation from CWI[3]) but that would be good to
confirm.

There is also a Rust implementation of ALP[1] that is part of the Vortex
file format implementation. I have not reviewed it to see if it deviates
from the algorithm presented in the paper.

Andrew

[1]:
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
[2]:
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
[3]: https://github.com/cwida/ALP


On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou <[email protected]> wrote:

>
> Hello,
>
> Thanks for doing this and I agree the numbers look impressive.
>
> I would ask if possible for more data points:
>
> 1. More datasets: you could for example look at the datasets that were
> used to originally evalute BYTE_STREAM_SPLIT (see
> https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> the Google Doc linked there)
>
> 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
>
> 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> remain very good there as well
>
>
> I also have the following reservations towards ALP:
>
> 1. There is no published official spec AFAICT, just a research paper.
>
> 2. I may be missing something, but the paper doesn't seem to mention
> non-finite values (such as +/-Inf and NaNs).
>
> 3. It seems there is a single implementation, which is the one published
> together with the paper. It is not obvious that it will be
> maintained in the future, and reusing it is probably not an option for
> non-C++ Parquet implementations
>
> 4. The encoding itself is complex, since it involves a fallback on
> another encoding if the primary encoding (which constitutes the real
> innovation) doesn't work out on a piece of data.
>
>
> Based on this, I would say that if we think ALP is attractive for us,
> we may want to incorporate our own version of ALP with the following
> changes:
>
> 1. Design and write our own Parquet-ALP spec so that implementations
> know exactly how to encode and represent data
>
> 2. Do not include the ALPrd fallback which is a homegrown dictionary
> encoding without dictionary reuse accross pages, and instead rely on a
> well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
>
> 3. Replace the FOR encoding inside ALP, which aims at compressing
> integers efficiently, with our own DELTA_BINARY_PACKED (which has the
> same qualities and is already available in Parquet implementations)
>
> Regards
>
> Antoine.
>
>
>
> On Thu, 16 Oct 2025 14:47:33 -0700
> PRATEEK GAUR <[email protected]> wrote:
> > Hi team,
> >
> > We spent some time evaluating ALP compression and decompression compared
> to
> > other encoding alternatives like CHIMP/GORILLA and compression techniques
> > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> > on October 15th in the biweekly parquet meeting. ( I can't seem to access
> > the recording, so please let me know what access rules I need to get to
> be
> > able to view it )
> >
> > We did this evaluation over some datasets pointed by the ALP paper and
> some
> > pointed by the parquet community.
> >
> > The results are available in the following document
> > <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> > :
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >
> > Based on the numbers we see
> >
> >    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio
> and
> >    much better compared to other schemes. (numbers in the sheet are bytes
> >    needed to encode each value )
> >    - ALP going quite well in terms of decompression speed (numbers in the
> >    sheet are bytes decompressed per second)
> >
> > As next steps we will
> >
> >    - Get the numbers for compression on top of byte stream split.
> >    - Evaluate the algorithm over a few more datasets.
> >    - Have an implementation in the arrow-parquet repo.
> >
> > Looking forward to feedback from the community.
> >
> > Best
> > Prateek and Dhirhan
> >
>
>
>
>

Reply via email to