Hello,
Thanks for doing this and I agree the numbers look impressive. I would ask if possible for more data points: 1. More datasets: you could for example look at the datasets that were used to originally evalute BYTE_STREAM_SPLIT (see https://issues.apache.org/jira/browse/PARQUET-1622 and specifically the Google Doc linked there) 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD 3. Optionally, some perf numbers on x86 too, but I expect that ALP will remain very good there as well I also have the following reservations towards ALP: 1. There is no published official spec AFAICT, just a research paper. 2. I may be missing something, but the paper doesn't seem to mention non-finite values (such as +/-Inf and NaNs). 3. It seems there is a single implementation, which is the one published together with the paper. It is not obvious that it will be maintained in the future, and reusing it is probably not an option for non-C++ Parquet implementations 4. The encoding itself is complex, since it involves a fallback on another encoding if the primary encoding (which constitutes the real innovation) doesn't work out on a piece of data. Based on this, I would say that if we think ALP is attractive for us, we may want to incorporate our own version of ALP with the following changes: 1. Design and write our own Parquet-ALP spec so that implementations know exactly how to encode and represent data 2. Do not include the ALPrd fallback which is a homegrown dictionary encoding without dictionary reuse accross pages, and instead rely on a well-known Parquet encoding (such as BYTE_STREAM_SPLIT?) 3. Replace the FOR encoding inside ALP, which aims at compressing integers efficiently, with our own DELTA_BINARY_PACKED (which has the same qualities and is already available in Parquet implementations) Regards Antoine. On Thu, 16 Oct 2025 14:47:33 -0700 PRATEEK GAUR <[email protected]> wrote: > Hi team, > > We spent some time evaluating ALP compression and decompression compared to > other encoding alternatives like CHIMP/GORILLA and compression techniques > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members > on October 15th in the biweekly parquet meeting. ( I can't seem to access > the recording, so please let me know what access rules I need to get to be > able to view it ) > > We did this evaluation over some datasets pointed by the ALP paper and some > pointed by the parquet community. > > The results are available in the following document > <https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0> > : > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > Based on the numbers we see > > - ALP is comparable to ZSTD(level=1) in terms of compression ratio and > much better compared to other schemes. (numbers in the sheet are bytes > needed to encode each value ) > - ALP going quite well in terms of decompression speed (numbers in the > sheet are bytes decompressed per second) > > As next steps we will > > - Get the numbers for compression on top of byte stream split. > - Evaluate the algorithm over a few more datasets. > - Have an implementation in the arrow-parquet repo. > > Looking forward to feedback from the community. > > Best > Prateek and Dhirhan >
