Hi team,

There was a request from a few folks, Antoine Pitrou and Adam Reeve if I
remember correctly, to perform the experiment on some of the papers that
talked about BYTE_STREAM_SPLIT for completeness.
I wanted to share the numbers for the same in this sheet. At this point we
have numbers on a wide variety of data.
(Will have to share the sheet from my snowflake account as our laptops have
fair bit of restriction with respect to copy paste permissions :) )

Best
Prateek

On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]> wrote:

> Hi Julien,
>
> Yes based on
>
>    - Numbers presented
>    - Discussions over the doc and
>    - Multiple discussions in the biweekly meeting
>
> We are in a stage where we agree this is the right encoding to add and we
> can move to the DRAFT/POC stage from DISCUSS stage.
> Will start working on the PR for the same.
>
> Thanks for bringing this up.
> Prateek
>
> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]> wrote:
>
>> @PRATEEK GAUR <[email protected]> : Would you agree that we are past
>> the DISCUSS step and into the DRAFT/POC phase according to the proposals
>> process <https://github.com/apache/parquet-format/tree/master/proposals>?
>> If yes, could you open a PR on this page to add this proposal to the list?
>> https://github.com/apache/parquet-format/tree/master/proposals
>> Thank you!
>>
>>
>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]>
>> wrote:
>>
>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in the
>> Rust
>> > Parquet reader if anyone is interested
>> >
>> > Andrew
>> >
>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
>> >
>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <[email protected]>
>> > wrote:
>> >
>> > > >
>> > > > C++, Java and Rust support them for sure. I feel like we should
>> > > > probably default to V2 at some point.
>> > >
>> > >
>> > > I seem to recall, some of the vectorized java readers (Iceberg, Spark)
>> > > might not support V2 data pages (but I might be confusing this with
>> > > encodings).  But this is only a vague recollection.
>> > >
>> > >
>> > >
>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <[email protected]>
>> > > wrote:
>> > >
>> > > > > Someone has to add V2 data pages to
>> > > > >
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>> > > > > :)
>> > > >
>> > > > Your wish is my command:
>> > https://github.com/apache/parquet-site/pull/124
>> > > >
>> > > > As the format grows in popularity and momentum builds to evolve, I
>> feel
>> > > the
>> > > > content on the parquet.apache.org site could use refreshing /
>> > updating.
>> > > > So, while I had the site open, I made some other PRs to scratch
>> various
>> > > > itches
>> > > >
>> > > > (I am absolutely 🎣 for someone to please review 🙏):
>> > > >
>> > > > 1. Add Variant/Geometry/Geography types to implementation status
>> > matrix:
>> > > > https://github.com/apache/parquet-site/pull/123
>> > > > 2. Improve introduction / overview, add more links to spec and
>> > > > implementation status:
>> https://github.com/apache/parquet-site/pull/125
>> > > >
>> > > >
>> > > > Thanks,
>> > > > Andrew
>> > > >
>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <[email protected]>
>> > > wrote:
>> > > >
>> > > > >
>> > > > > Hi Julien, hi all,
>> > > > >
>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
>> > > > > Julien Le Dem <[email protected]> wrote:
>> > > > > >
>> > > > > > Another question from me:
>> > > > > >
>> > > > > > Since the goal is to not use compression at all in this case (no
>> > > ZSTD)
>> > > > > > I'm assuming we would be using either:
>> > > > > > - the Data Page V1 with UNCOMPRESSED in the
>> ColumnMetadata.column
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
>> > > > > >
>> > > > > > field.
>> > > > > > - the Data Page V2 with false in the
>> DataPageHeaderV2.is_compressed
>> > > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
>> > > > > >
>> > > > > > field
>> > > > > > The second helping decide if we can selectively compress some
>> pages
>> > > if
>> > > > > they
>> > > > > > are less compressed by the
>> > > > > > A few years ago there was a question on the support of the
>> > > DATA_PAGE_V2
>> > > > > and
>> > > > > > I was curious to hear a refresh on how that's generally
>> supported
>> > in
>> > > > > > Parquet implementations. The is_compressed field was exactly
>> > intended
>> > > > to
>> > > > > > avoid block compression when the encoding itself is good enough.
>> > > > >
>> > > > > Someone has to add V2 data pages to
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
>> > > > > :)
>> > > > >
>> > > > > C++, Java and Rust support them for sure. I feel like we should
>> > > > > probably default to V2 at some point.
>> > > > >
>> > > > > Also see https://github.com/apache/parquet-java/issues/3344 for
>> > Java.
>> > > > >
>> > > > > Regards
>> > > > >
>> > > > > Antoine.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > Julien
>> > > > > >
>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
>> > > > > <[email protected]> wrote:
>> > > > > >
>> > > > > > > Thanks again Prateek and co for pushing this along!
>> > > > > > >
>> > > > > > >
>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
>> > > > implementations
>> > > > > > > > know exactly how to encode and represent data
>> > > > > > >
>> > > > > > > 100% agree with this (similar to what was done for
>> > ParquetVariant)
>> > > > > > >
>> > > > > > > > 2. I may be missing something, but the paper doesn't seem to
>> > > > > mention
>> > > > > > > non-finite values (such as +/-Inf and NaNs).
>> > > > > > >
>> > > > > > > I think they are handled via the "Exception" mechanism.
>> Vortex's
>> > > ALP
>> > > > > > > implementation (below) does appear to handle finite numbers[2]
>> > > > > > >
>> > > > > > > > 3. It seems there is a single implementation, which is the
>> one
>> > > > > published
>> > > > > > > > together with the paper. It is not obvious that it will be
>> > > > > > > > maintained in the future, and reusing it is probably not an
>> > > option
>> > > > > for
>> > > > > > > > non-C++ Parquet implementations
>> > > > > > >
>> > > > > > > My understanding from the call was that Prateek and team
>> > > > re-implemented
>> > > > > > > ALP  (did not use the implementation from CWI[3]) but that
>> would
>> > be
>> > > > > good to
>> > > > > > > confirm.
>> > > > > > >
>> > > > > > > There is also a Rust implementation of ALP[1] that is part of
>> the
>> > > > > Vortex
>> > > > > > > file format implementation. I have not reviewed it to see if
>> it
>> > > > > deviates
>> > > > > > > from the algorithm presented in the paper.
>> > > > > > >
>> > > > > > > Andrew
>> > > > > > >
>> > > > > > > [1]:
>> > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
>> > > > > > > [2]:
>> > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
>> > > > > > > [3]: https://github.com/cwida/ALP
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou
>> > > > > <[email protected]> wrote:
>> > > > > > >
>> > > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > > Thanks for doing this and I agree the numbers look
>> impressive.
>> > > > > > > >
>> > > > > > > > I would ask if possible for more data points:
>> > > > > > > >
>> > > > > > > > 1. More datasets: you could for example look at the datasets
>> > that
>> > > > > were
>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see
>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and
>> > > > specifically
>> > > > > > > > the Google Doc linked there)
>> > > > > > > >
>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and
>> BYTE_STREAM_SPLIT
>> > +
>> > > > ZSTD
>> > > > > > > >
>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I expect
>> that
>> > > ALP
>> > > > > will
>> > > > > > > > remain very good there as well
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I also have the following reservations towards ALP:
>> > > > > > > >
>> > > > > > > > 1. There is no published official spec AFAICT, just a
>> research
>> > > > paper.
>> > > > > > > >
>> > > > > > > > 2. I may be missing something, but the paper doesn't seem to
>> > > > mention
>> > > > > > > > non-finite values (such as +/-Inf and NaNs).
>> > > > > > > >
>> > > > > > > > 3. It seems there is a single implementation, which is the
>> one
>> > > > > published
>> > > > > > > > together with the paper. It is not obvious that it will be
>> > > > > > > > maintained in the future, and reusing it is probably not an
>> > > option
>> > > > > for
>> > > > > > > > non-C++ Parquet implementations
>> > > > > > > >
>> > > > > > > > 4. The encoding itself is complex, since it involves a
>> fallback
>> > > on
>> > > > > > > > another encoding if the primary encoding (which constitutes
>> the
>> > > > real
>> > > > > > > > innovation) doesn't work out on a piece of data.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Based on this, I would say that if we think ALP is
>> attractive
>> > for
>> > > > us,
>> > > > > > > > we may want to incorporate our own version of ALP with the
>> > > > following
>> > > > > > > > changes:
>> > > > > > > >
>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
>> > > > implementations
>> > > > > > > > know exactly how to encode and represent data
>> > > > > > > >
>> > > > > > > > 2. Do not include the ALPrd fallback which is a homegrown
>> > > > dictionary
>> > > > > > > > encoding without dictionary reuse accross pages, and instead
>> > rely
>> > > > on
>> > > > > a
>> > > > > > > > well-known Parquet encoding (such as BYTE_STREAM_SPLIT?)
>> > > > > > > >
>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at
>> > compressing
>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED
>> (which
>> > has
>> > > > the
>> > > > > > > > same qualities and is already available in Parquet
>> > > implementations)
>> > > > > > > >
>> > > > > > > > Regards
>> > > > > > > >
>> > > > > > > > Antoine.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700
>> > > > > > > > PRATEEK GAUR <[email protected]> wrote:
>> > > > > > > > > Hi team,
>> > > > > > > > >
>> > > > > > > > > We spent some time evaluating ALP compression and
>> > decompression
>> > > > > > > compared
>> > > > > > > > to
>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and
>> > compression
>> > > > > > > techniques
>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to the
>> > > community
>> > > > > > > members
>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I can't
>> > seem
>> > > > > to
>> > > > > > > access
>> > > > > > > > > the recording, so please let me know what access rules I
>> need
>> > > to
>> > > > > get to
>> > > > > > > > be
>> > > > > > > > > able to view it )
>> > > > > > > > >
>> > > > > > > > > We did this evaluation over some datasets pointed by the
>> ALP
>> > > > paper
>> > > > > and
>> > > > > > > > some
>> > > > > > > > > pointed by the parquet community.
>> > > > > > > > >
>> > > > > > > > > The results are available in the following document
>> > > > > > > > > <
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
>> > > > >
>> > > > > > > > >
>> > > > > > > > > :
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
>> > > > >
>> > > > > > > > >
>> > > > > > > > > Based on the numbers we see
>> > > > > > > > >
>> > > > > > > > >    -  ALP is comparable to ZSTD(level=1) in terms of
>> > > compression
>> > > > > ratio
>> > > > > > > > and
>> > > > > > > > >    much better compared to other schemes. (numbers in the
>> > sheet
>> > > > > are
>> > > > > > > bytes
>> > > > > > > > >    needed to encode each value )
>> > > > > > > > >    - ALP going quite well in terms of decompression speed
>> > > > (numbers
>> > > > > in
>> > > > > > > the
>> > > > > > > > >    sheet are bytes decompressed per second)
>> > > > > > > > >
>> > > > > > > > > As next steps we will
>> > > > > > > > >
>> > > > > > > > >    - Get the numbers for compression on top of byte stream
>> > > split.
>> > > > > > > > >    - Evaluate the algorithm over a few more datasets.
>> > > > > > > > >    - Have an implementation in the arrow-parquet repo.
>> > > > > > > > >
>> > > > > > > > > Looking forward to feedback from the community.
>> > > > > > > > >
>> > > > > > > > > Best
>> > > > > > > > > Prateek and Dhirhan
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Reply via email to