Thanks again Prateek and co for pushing this along!
> 1. Design and write our own Parquet-ALP spec so that implementations > know exactly how to encode and represent data 100% agree with this (similar to what was done for ParquetVariant) > 2. I may be missing something, but the paper doesn't seem to mention non-finite values (such as +/-Inf and NaNs). I think they are handled via the "Exception" mechanism. Vortex's ALP implementation (below) does appear to handle finite numbers[2] > 3. It seems there is a single implementation, which is the one published > together with the paper. It is not obvious that it will be > maintained in the future, and reusing it is probably not an option for > non-C++ Parquet implementations My understanding from the call was that Prateek and team re-implemented ALP (did not use the implementation from CWI[3]) but that would be good to confirm. There is also a Rust implementation of ALP[1] that is part of the Vortex file format implementation. I have not reviewed it to see if it deviates from the algorithm presented in the paper. Andrew [1]: https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs [2]: https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281 [3]: https://github.com/cwida/ALP On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou <[email protected]> wrote: > > Hello, > > Thanks for doing this and I agree the numbers look impressive. > > I would ask if possible for more data points: > > 1. More datasets: you could for example look at the datasets that were > used to originally evalute BYTE_STREAM_SPLIT (see > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically > the Google Doc linked there) > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD > > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will > remain very good there as well > > > I also have the following reservations towards ALP: > > 1. There is no published official spec AFAICT, just a research paper. > > 2. I may be missing something, but the paper doesn't seem to mention > non-finite values (such as +/-Inf and NaNs). > > 3. It seems there is a single implementation, which is the one published > together with the paper. It is not obvious that it will be > maintained in the future, and reusing it is probably not an option for > non-C++ Parquet implementations > > 4. The encoding itself is complex, since it involves a fallback on > another encoding if the primary encoding (which constitutes the real > innovation) doesn't work out on a piece of data. > > > Based on this, I would say that if we think ALP is attractive for us, > we may want to incorporate our own version of ALP with the following > changes: > > 1. Design and write our own Parquet-ALP spec so that implementations > know exactly how to encode and represent data > > 2. Do not include the ALPrd fallback which is a homegrown dictionary > encoding without dictionary reuse accross pages, and instead rely on a > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?) > > 3. Replace the FOR encoding inside ALP, which aims at compressing > integers efficiently, with our own DELTA_BINARY_PACKED (which has the > same qualities and is already available in Parquet implementations) > > Regards > > Antoine. > > > > On Thu, 16 Oct 2025 14:47:33 -0700 > PRATEEK GAUR <[email protected]> wrote: > > Hi team, > > > > We spent some time evaluating ALP compression and decompression compared > to > > other encoding alternatives like CHIMP/GORILLA and compression techniques > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members > > on October 15th in the biweekly parquet meeting. ( I can't seem to access > > the recording, so please let me know what access rules I need to get to > be > > able to view it ) > > > > We did this evaluation over some datasets pointed by the ALP paper and > some > > pointed by the parquet community. > > > > The results are available in the following document > > < > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > > > > : > > > https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg > > > > Based on the numbers we see > > > > - ALP is comparable to ZSTD(level=1) in terms of compression ratio > and > > much better compared to other schemes. (numbers in the sheet are bytes > > needed to encode each value ) > > - ALP going quite well in terms of decompression speed (numbers in the > > sheet are bytes decompressed per second) > > > > As next steps we will > > > > - Get the numbers for compression on top of byte stream split. > > - Evaluate the algorithm over a few more datasets. > > - Have an implementation in the arrow-parquet repo. > > > > Looking forward to feedback from the community. > > > > Best > > Prateek and Dhirhan > > > > > >
