Hello,

Thanks for doing this and I agree the numbers look impressive.

I would ask if possible for more data points:

1. More datasets: you could for example look at the datasets that were
used to originally evalute BYTE_STREAM_SPLIT (see
https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
the Google Doc linked there)
 
2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD

3. Optionally, some perf numbers on x86 too, but I expect that ALP will
remain very good there as well
   
   
I also have the following reservations towards ALP:

1. There is no published official spec AFAICT, just a research paper.

2. I may be missing something, but the paper doesn't seem to mention
non-finite values (such as +/-Inf and NaNs).

3. It seems there is a single implementation, which is the one published
together with the paper. It is not obvious that it will be
maintained in the future, and reusing it is probably not an option for
non-C++ Parquet implementations
   
4. The encoding itself is complex, since it involves a fallback on
another encoding if the primary encoding (which constitutes the real
innovation) doesn't work out on a piece of data.


Based on this, I would say that if we think ALP is attractive for us,
we may want to incorporate our own version of ALP with the following
changes:

1. Design and write our own Parquet-ALP spec so that implementations
know exactly how to encode and represent data

2. Do not include the ALPrd fallback which is a homegrown dictionary
encoding without dictionary reuse accross pages, and instead rely on a
well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)

3. Replace the FOR encoding inside ALP, which aims at compressing
integers efficiently, with our own DELTA_BINARY_PACKED (which has the
same qualities and is already available in Parquet implementations)

Regards

Antoine.



On Thu, 16 Oct 2025 14:47:33 -0700
PRATEEK GAUR <[email protected]> wrote:
> Hi team,
> 
> We spent some time evaluating ALP compression and decompression compared to
> other encoding alternatives like CHIMP/GORILLA and compression techniques
> like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> on October 15th in the biweekly parquet meeting. ( I can't seem to access
> the recording, so please let me know what access rules I need to get to be
> able to view it )
> 
> We did this evaluation over some datasets pointed by the ALP paper and some
> pointed by the parquet community.
> 
> The results are available in the following document
> <https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0>
> :
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> 
> Based on the numbers we see
> 
>    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
>    much better compared to other schemes. (numbers in the sheet are bytes
>    needed to encode each value )
>    - ALP going quite well in terms of decompression speed (numbers in the
>    sheet are bytes decompressed per second)
> 
> As next steps we will
> 
>    - Get the numbers for compression on top of byte stream split.
>    - Evaluate the algorithm over a few more datasets.
>    - Have an implementation in the arrow-parquet repo.
> 
> Looking forward to feedback from the community.
> 
> Best
> Prateek and Dhirhan
> 



Reply via email to