Thanks everyone.

I'm gonna try and answer as many questions as I can in this reply.


   - [Antoine] More datasets : Thanks we'll look into this.
   - [Antoine + Adam] ByteStreamSplit + ZSTD : Added another column to this
   sheet
   
<https://docs.google.com/spreadsheets/d/1H1EPRobuzyELPLIcd__db6aC2Hhnr2ux/edit?gid=2806261#gid=2806261>
   with this info.
   - [Antoine] Numbers on x86 : Ack, i'll run the same experiment on an
   intel machine (will try to get to it before EOW)


More queries

   - [Antoine + Andew + Julien] Official spec : Yes I'll try and start one
   and then I'm gonna request help from everyone to make it perfect :).
   - [Antoine + Andrew] Non-finite values : One tags them as exceptions.
   - [Antoine] ALPrd -> ByteStreamSplit : Actually I really like this idea.
   I did not see a major perf difference or compression ratio improvement when
   one has to use the ALPrd route. What do you think @andrew.lamb, @Julien
   Le Dem <[email protected]>, @[email protected]
   <[email protected]> , @[email protected]
   <[email protected]>, @adam.reeve
   - [Antoine] FOR -> DELTA_BINARY_PACKED : So DELTA is significantly
   slower than FOR based on numbers I have seen. FOR is very simple and can be
   further improved in decompression speed. I like the FOR approach.
   - [Julien] BtrBlocks : I really like the flexibility that this brings in
   but we'll have to come up with a good spec for it to be broadly applicable.
   We can discuss more.


Best
Prateek

On Mon, Oct 20, 2025 at 3:15 PM Julien Le Dem <[email protected]> wrote:

> >
> >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> >
> > 100% agree with this (similar to what was done for ParquetVariant)
> >
> Seconded!
>
> > 4. The encoding itself is complex, since it involves a fallback on
> > another encoding if the primary encoding (which constitutes the real
> > innovation) doesn't work out on a piece of data.
>
> We had a discussion on how to layer/chain/stack encodings in the call. A la
> BTRblocks. We could have a general mechanism to clarify how to reuse an
> encoding and make it generally flexible to compose encodings in ways that
> would be a bit more flexible that the current way (instead of
> being constrained  to an enum for the encoding, we could allow a bit more
> metadata). For example it has been discussed to use FastLanes to store the
> integers produced by ALP (the current prototype uses bitpacking). I
> understand, this is what Vortex does.
>
> >
> >
> >
>
> Another question from me:
>
> Since the goal is to not use compression at all in this case (no ZSTD)
> I'm assuming we would be using either:
> - the Data Page V1 with UNCOMPRESSED in the ColumnMetadata.column
> <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >
> field.
> - the Data Page V2 with false in the DataPageHeaderV2.is_compressed
> <
> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >
> field
> The second helping decide if we can selectively compress some pages if they
> are less compressed by the
> A few years ago there was a question on the support of the DATA_PAGE_V2 and
> I was curious to hear a refresh on how that's generally supported in
> Parquet implementations. The is_compressed field was exactly intended to
> avoid block compression when the encoding itself is good enough.
>
> Julien
>
> On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb <[email protected]>
> wrote:
>
> > Thanks again Prateek and co for pushing this along!
> >
> >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> >
> > 100% agree with this (similar to what was done for ParquetVariant)
> >
> > > 2. I may be missing something, but the paper doesn't seem to mention
> > non-finite values (such as +/-Inf and NaNs).
> >
> > I think they are handled via the "Exception" mechanism. Vortex's ALP
> > implementation (below) does appear to handle finite numbers[2]
> >
> > > 3. It seems there is a single implementation, which is the one
> published
> > > together with the paper. It is not obvious that it will be
> > > maintained in the future, and reusing it is probably not an option for
> > > non-C++ Parquet implementations
> >
> > My understanding from the call was that Prateek and team re-implemented
> > ALP  (did not use the implementation from CWI[3]) but that would be good
> to
> > confirm.
> >
> > There is also a Rust implementation of ALP[1] that is part of the Vortex
> > file format implementation. I have not reviewed it to see if it deviates
> > from the algorithm presented in the paper.
> >
> > Andrew
> >
> > [1]:
> >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> > [2]:
> >
> >
> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> > [3]: https://github.com/cwida/ALP
> >
> >
> > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Hello,
> > >
> > > Thanks for doing this and I agree the numbers look impressive.
> > >
> > > I would ask if possible for more data points:
> > >
> > > 1. More datasets: you could for example look at the datasets that were
> > > used to originally evalute BYTE_STREAM_SPLIT (see
> > > https://issues.apache.org/jira/browse/PARQUET-1622 and specifically
> > > the Google Doc linked there)
> > >
> > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and BYTE_STREAM_SPLIT + ZSTD
> > >
> > > 3. Optionally, some perf numbers on x86 too, but I expect that ALP will
> > > remain very good there as well
> > >
> > >
> > > I also have the following reservations towards ALP:
> > >
> > > 1. There is no published official spec AFAICT, just a research paper.
> > >
> > > 2. I may be missing something, but the paper doesn't seem to mention
> > > non-finite values (such as +/-Inf and NaNs).
> > >
> > > 3. It seems there is a single implementation, which is the one
> published
> > > together with the paper. It is not obvious that it will be
> > > maintained in the future, and reusing it is probably not an option for
> > > non-C++ Parquet implementations
> > >
> > > 4. The encoding itself is complex, since it involves a fallback on
> > > another encoding if the primary encoding (which constitutes the real
> > > innovation) doesn't work out on a piece of data.
> > >
> > >
> > > Based on this, I would say that if we think ALP is attractive for us,
> > > we may want to incorporate our own version of ALP with the following
> > > changes:
> > >
> > > 1. Design and write our own Parquet-ALP spec so that implementations
> > > know exactly how to encode and represent data
> > >
> > > 2. Do not include the ALPrd fallback which is a homegrown dictionary
> > > encoding without dictionary reuse accross pages, and instead rely on a
> > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
> > >
> > > 3. Replace the FOR encoding inside ALP, which aims at compressing
> > > integers efficiently, with our own DELTA_BINARY_PACKED (which has the
> > > same qualities and is already available in Parquet implementations)
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > On Thu, 16 Oct 2025 14:47:33 -0700
> > > PRATEEK GAUR <[email protected]> wrote:
> > > > Hi team,
> > > >
> > > > We spent some time evaluating ALP compression and decompression
> > compared
> > > to
> > > > other encoding alternatives like CHIMP/GORILLA and compression
> > techniques
> > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the community
> > members
> > > > on October 15th in the biweekly parquet meeting. ( I can't seem to
> > access
> > > > the recording, so please let me know what access rules I need to get
> to
> > > be
> > > > able to view it )
> > > >
> > > > We did this evaluation over some datasets pointed by the ALP paper
> and
> > > some
> > > > pointed by the parquet community.
> > > >
> > > > The results are available in the following document
> > > > <
> > >
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> > > >
> > > > :
> > > >
> > >
> >
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> > > >
> > > > Based on the numbers we see
> > > >
> > > >    -  ALP is comparable to ZSTD(level=1) in terms of compression
> ratio
> > > and
> > > >    much better compared to other schemes. (numbers in the sheet are
> > bytes
> > > >    needed to encode each value )
> > > >    - ALP going quite well in terms of decompression speed (numbers in
> > the
> > > >    sheet are bytes decompressed per second)
> > > >
> > > > As next steps we will
> > > >
> > > >    - Get the numbers for compression on top of byte stream split.
> > > >    - Evaluate the algorithm over a few more datasets.
> > > >    - Have an implementation in the arrow-parquet repo.
> > > >
> > > > Looking forward to feedback from the community.
> > > >
> > > > Best
> > > > Prateek and Dhirhan
> > > >
> > >
> > >
> > >
> > >
> >
>

Reply via email to