RE: Re: [Parquet] ALP Encoding for Floating point data

Azim Afroozeh Mon, 24 Nov 2025 10:08:30 -0800

Hi everyone,

Azim here, first author of the ALP paper.


Great to see the ALP moving into Parquet. I wanted to share one
recommendation that may help when using BYTE_STREAM_SPLIT with ZSTD for
real double-precision data, based on what we learned while designing ALP_RD.

Recommendation:
When using BYTE_STREAM_SPLIT as the fallback for real double floating-point
columns, after applying BYTE_STREAM_SPLIT and obtaining the byte streams,
consider a design where only the first two byte streams are compressed with
ZSTD, rather than applying ZSTD to all byte streams.

Rationale:
During the design of ALP_RD we found that only the high-order bytes contain
meaningful, repeatable patterns. ALP_RD therefore focuses on the first 16
bits (2bytes) and applies a specialized dictionary-style encoding that
captures redundancy in the sign, exponent, and upper mantissa bits. These
are the parts of floating-point numbers where we consistently observed
structure that is compressible.

>From this experience, I would expect that applying ZSTD only to the first
two BYTE_STREAM_SPLIT streams would achieve similar (or sometimes better)
compression ratios than ALP_RD, while avoiding compression of the remaining
byte streams (the lower mantissa bytes), which are effectively high-entropy
noise. ZSTD generally cannot compress these streams, and in some cases
compressing them actually increases the encoded size. Leaving those byte
streams uncompressed also improves decompression speed.

By focusing compression only on the first two byte streams, you retain
almost all of the benefit that ALP_RD provided while keeping the fallback
much simpler and avoiding negative compression on noisy byte streams.

Technical note:
Since ZSTD is a page-level compression codec and BYTE_STREAM_SPLIT is an
encoding, this selective approach cannot be expressed with the current
layering. However, if you consider introducing a new encoding (for example,
something like BYTE_STREAM_SPLIT_ZSTD), that encoding could internally
apply ZSTD only to the first two byte streams and leave the remaining
streams uncompressed.

Happy to share more details if useful.

Best,
Azim

On 2025/11/21 01:21:26 Prateek Gaur via dev wrote:
> Hi team,
>
> *ALP ---> ALP PeudoDecimal*
>
> As is visible from the numbers above and as stated in the paper too for
> real double values, i.e the values with high precision points, it is very
> difficult to get a good compression ratio.
>
> This combined with the fact that we want to keep the spec/implementation
> simpler, stating Antoine directly here
>
> `*2. Do not include the ALPrd fallback which is a homegrown dictionary*
>
> *encoding without dictionary reuse accross pages, and instead rely on
> awell-known Parquet encoding (such as BYTE_STREAM_SPLIT?)*`
>
> Also based on some discussion I had with Julien in person and the biweekly
> meeting with a number of you.
>
> We'll be going with ALPpd (pseudo decimal) as the first
> implementation relying on the query engine based on its own heuristics to
> decide on the right fallback to BYTE_STREAM_SPLIT of ZSTD.
>
> Best
> Prateek
>
>
>
> On Thu, Nov 20, 2025 at 5:09 PM Prateek Gaur <[email protected]>
> wrote:
>
> > Sheet with numbers
> > <
https://docs.google.com/spreadsheets/d/1NmCg0WZKeZUc6vNXXD8M3GIyNqF_H3goj6mVbT8at7A/edit?gid=1351944517#gid=1351944517
>
> > .
> >
> > On Thu, Nov 20, 2025 at 5:09 PM PRATEEK GAUR <[email protected]> wrote:
> >
> >> Hi team,
> >>
> >> There was a request from a few folks, Antoine Pitrou and Adam Reeve if
I
> >> remember correctly, to perform the experiment on some of the papers
that
> >> talked about BYTE_STREAM_SPLIT for completeness.
> >> I wanted to share the numbers for the same in this sheet. At this point
> >> we have numbers on a wide variety of data.
> >> (Will have to share the sheet from my snowflake account as our laptops
> >> have fair bit of restriction with respect to copy paste permissions :)
)
> >>
> >> Best
> >> Prateek
> >>
> >> On Thu, Nov 20, 2025 at 2:25 PM PRATEEK GAUR <[email protected]> wrote:
> >>
> >>> Hi Julien,
> >>>
> >>> Yes based on
> >>>
> >>>    - Numbers presented
> >>>    - Discussions over the doc and
> >>>    - Multiple discussions in the biweekly meeting
> >>>
> >>> We are in a stage where we agree this is the right encoding to add and
> >>> we can move to the DRAFT/POC stage from DISCUSS stage.
> >>> Will start working on the PR for the same.
> >>>
> >>> Thanks for bringing this up.
> >>> Prateek
> >>>
> >>> On Thu, Nov 20, 2025 at 8:16 AM Julien Le Dem <[email protected]>
wrote:
> >>>
> >>>> @PRATEEK GAUR <[email protected]> : Would you agree that we are past
> >>>> the DISCUSS step and into the DRAFT/POC phase according to the
proposals
> >>>> process <
https://github.com/apache/parquet-format/tree/master/proposals
> >>>> >?
> >>>> If yes, could you open a PR on this page to add this proposal to the
> >>>> list?
> >>>> https://github.com/apache/parquet-format/tree/master/proposals
> >>>> Thank you!
> >>>>
> >>>>
> >>>> On Thu, Oct 30, 2025 at 2:38 PM Andrew Lamb <[email protected]>
> >>>> wrote:
> >>>>
> >>>> > I have filed a ticket[1] in arrow-rs to track prototyping ALP in
the
> >>>> Rust
> >>>> > Parquet reader if anyone is interested
> >>>> >
> >>>> > Andrew
> >>>> >
> >>>> > [1]:  https://github.com/apache/arrow-rs/issues/8748
> >>>> >
> >>>> > On Wed, Oct 22, 2025 at 1:33 PM Micah Kornfield <
> >>>> [email protected]>
> >>>> > wrote:
> >>>> >
> >>>> > > >
> >>>> > > > C++, Java and Rust support them for sure. I feel like we should
> >>>> > > > probably default to V2 at some point.
> >>>> > >
> >>>> > >
> >>>> > > I seem to recall, some of the vectorized java readers (Iceberg,
> >>>> Spark)
> >>>> > > might not support V2 data pages (but I might be confusing this
with
> >>>> > > encodings).  But this is only a vague recollection.
> >>>> > >
> >>>> > >
> >>>> > >
> >>>> > > On Wed, Oct 22, 2025 at 6:38 AM Andrew Lamb <
[email protected]
> >>>> >
> >>>> > > wrote:
> >>>> > >
> >>>> > > > > Someone has to add V2 data pages to
> >>>> > > > >
> >>>> > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >>>> > > > > :)
> >>>> > > >
> >>>> > > > Your wish is my command:
> >>>> > https://github.com/apache/parquet-site/pull/124
> >>>> > > >
> >>>> > > > As the format grows in popularity and momentum builds to
evolve,
> >>>> I feel
> >>>> > > the
> >>>> > > > content on the parquet.apache.org site could use refreshing /
> >>>> > updating.
> >>>> > > > So, while I had the site open, I made some other PRs to scratch
> >>>> various
> >>>> > > > itches
> >>>> > > >
> >>>> > > > (I am absolutely 🎣 for someone to please review 🙏):
> >>>> > > >
> >>>> > > > 1. Add Variant/Geometry/Geography types to implementation
status
> >>>> > matrix:
> >>>> > > > https://github.com/apache/parquet-site/pull/123
> >>>> > > > 2. Improve introduction / overview, add more links to spec and
> >>>> > > > implementation status:
> >>>> https://github.com/apache/parquet-site/pull/125
> >>>> > > >
> >>>> > > >
> >>>> > > > Thanks,
> >>>> > > > Andrew
> >>>> > > >
> >>>> > > > On Wed, Oct 22, 2025 at 4:09 AM Antoine Pitrou <
> >>>> [email protected]>
> >>>> > > wrote:
> >>>> > > >
> >>>> > > > >
> >>>> > > > > Hi Julien, hi all,
> >>>> > > > >
> >>>> > > > > On Mon, 20 Oct 2025 15:14:58 -0700
> >>>> > > > > Julien Le Dem <[email protected]> wrote:
> >>>> > > > > >
> >>>> > > > > > Another question from me:
> >>>> > > > > >
> >>>> > > > > > Since the goal is to not use compression at all in this
case
> >>>> (no
> >>>> > > ZSTD)
> >>>> > > > > > I'm assuming we would be using either:
> >>>> > > > > > - the Data Page V1 with UNCOMPRESSED in the
> >>>> ColumnMetadata.column
> >>>> > > > > > <
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L887
> >>>> > > > > >
> >>>> > > > > > field.
> >>>> > > > > > - the Data Page V2 with false in the
> >>>> DataPageHeaderV2.is_compressed
> >>>> > > > > > <
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/apache/parquet-format/blob/786142e26740487930ddc3ec5e39d780bd930907/src/main/thrift/parquet.thrift#L746
> >>>> > > > > >
> >>>> > > > > > field
> >>>> > > > > > The second helping decide if we can selectively compress
some
> >>>> pages
> >>>> > > if
> >>>> > > > > they
> >>>> > > > > > are less compressed by the
> >>>> > > > > > A few years ago there was a question on the support of the
> >>>> > > DATA_PAGE_V2
> >>>> > > > > and
> >>>> > > > > > I was curious to hear a refresh on how that's generally
> >>>> supported
> >>>> > in
> >>>> > > > > > Parquet implementations. The is_compressed field was
exactly
> >>>> > intended
> >>>> > > > to
> >>>> > > > > > avoid block compression when the encoding itself is good
> >>>> enough.
> >>>> > > > >
> >>>> > > > > Someone has to add V2 data pages to
> >>>> > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/apache/parquet-site/blob/production/content/en/docs/File%20Format/implementationstatus.md
> >>>> > > > > :)
> >>>> > > > >
> >>>> > > > > C++, Java and Rust support them for sure. I feel like we
should
> >>>> > > > > probably default to V2 at some point.
> >>>> > > > >
> >>>> > > > > Also see https://github.com/apache/parquet-java/issues/3344
for
> >>>> > Java.
> >>>> > > > >
> >>>> > > > > Regards
> >>>> > > > >
> >>>> > > > > Antoine.
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > >
> >>>> > > > > > Julien
> >>>> > > > > >
> >>>> > > > > > On Mon, Oct 20, 2025 at 11:57 AM Andrew Lamb
> >>>> > > > > <[email protected]> wrote:
> >>>> > > > > >
> >>>> > > > > > > Thanks again Prateek and co for pushing this along!
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
> >>>> > > > implementations
> >>>> > > > > > > > know exactly how to encode and represent data
> >>>> > > > > > >
> >>>> > > > > > > 100% agree with this (similar to what was done for
> >>>> > ParquetVariant)
> >>>> > > > > > >
> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't
seem
> >>>> to
> >>>> > > > > mention
> >>>> > > > > > > non-finite values (such as +/-Inf and NaNs).
> >>>> > > > > > >
> >>>> > > > > > > I think they are handled via the "Exception" mechanism.
> >>>> Vortex's
> >>>> > > ALP
> >>>> > > > > > > implementation (below) does appear to handle finite
> >>>> numbers[2]
> >>>> > > > > > >
> >>>> > > > > > > > 3. It seems there is a single implementation, which is
> >>>> the one
> >>>> > > > > published
> >>>> > > > > > > > together with the paper. It is not obvious that it
will be
> >>>> > > > > > > > maintained in the future, and reusing it is probably
not
> >>>> an
> >>>> > > option
> >>>> > > > > for
> >>>> > > > > > > > non-C++ Parquet implementations
> >>>> > > > > > >
> >>>> > > > > > > My understanding from the call was that Prateek and team
> >>>> > > > re-implemented
> >>>> > > > > > > ALP  (did not use the implementation from CWI[3]) but
that
> >>>> would
> >>>> > be
> >>>> > > > > good to
> >>>> > > > > > > confirm.
> >>>> > > > > > >
> >>>> > > > > > > There is also a Rust implementation of ALP[1] that is
part
> >>>> of the
> >>>> > > > > Vortex
> >>>> > > > > > > file format implementation. I have not reviewed it to see
> >>>> if it
> >>>> > > > > deviates
> >>>> > > > > > > from the algorithm presented in the paper.
> >>>> > > > > > >
> >>>> > > > > > > Andrew
> >>>> > > > > > >
> >>>> > > > > > > [1]:
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/lib.rs
> >>>> > > > > > > [2]:
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://github.com/vortex-data/vortex/blob/534821969201b91985a8735b23fc0c415a425a56/encodings/alp/src/alp/compress.rs#L266-L281
> >>>> > > > > > > [3]: https://github.com/cwida/ALP
> >>>> > > > > > >
> >>>> > > > > > >
> >>>> > > > > > > On Mon, Oct 20, 2025 at 4:47 AM Antoine Pitrou
> >>>> > > > > <[email protected]> wrote:
> >>>> > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > Hello,
> >>>> > > > > > > >
> >>>> > > > > > > > Thanks for doing this and I agree the numbers look
> >>>> impressive.
> >>>> > > > > > > >
> >>>> > > > > > > > I would ask if possible for more data points:
> >>>> > > > > > > >
> >>>> > > > > > > > 1. More datasets: you could for example look at the
> >>>> datasets
> >>>> > that
> >>>> > > > > were
> >>>> > > > > > > > used to originally evalute BYTE_STREAM_SPLIT (see
> >>>> > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1622 and
> >>>> > > > specifically
> >>>> > > > > > > > the Google Doc linked there)
> >>>> > > > > > > >
> >>>> > > > > > > > 2. Comparison to BYTE_STREAM_SPLIT + LZ4 and
> >>>> BYTE_STREAM_SPLIT
> >>>> > +
> >>>> > > > ZSTD
> >>>> > > > > > > >
> >>>> > > > > > > > 3. Optionally, some perf numbers on x86 too, but I
expect
> >>>> that
> >>>> > > ALP
> >>>> > > > > will
> >>>> > > > > > > > remain very good there as well
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > I also have the following reservations towards ALP:
> >>>> > > > > > > >
> >>>> > > > > > > > 1. There is no published official spec AFAICT, just a
> >>>> research
> >>>> > > > paper.
> >>>> > > > > > > >
> >>>> > > > > > > > 2. I may be missing something, but the paper doesn't
seem
> >>>> to
> >>>> > > > mention
> >>>> > > > > > > > non-finite values (such as +/-Inf and NaNs).
> >>>> > > > > > > >
> >>>> > > > > > > > 3. It seems there is a single implementation, which is
> >>>> the one
> >>>> > > > > published
> >>>> > > > > > > > together with the paper. It is not obvious that it
will be
> >>>> > > > > > > > maintained in the future, and reusing it is probably
not
> >>>> an
> >>>> > > option
> >>>> > > > > for
> >>>> > > > > > > > non-C++ Parquet implementations
> >>>> > > > > > > >
> >>>> > > > > > > > 4. The encoding itself is complex, since it involves a
> >>>> fallback
> >>>> > > on
> >>>> > > > > > > > another encoding if the primary encoding (which
> >>>> constitutes the
> >>>> > > > real
> >>>> > > > > > > > innovation) doesn't work out on a piece of data.
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > Based on this, I would say that if we think ALP is
> >>>> attractive
> >>>> > for
> >>>> > > > us,
> >>>> > > > > > > > we may want to incorporate our own version of ALP with
the
> >>>> > > > following
> >>>> > > > > > > > changes:
> >>>> > > > > > > >
> >>>> > > > > > > > 1. Design and write our own Parquet-ALP spec so that
> >>>> > > > implementations
> >>>> > > > > > > > know exactly how to encode and represent data
> >>>> > > > > > > >
> >>>> > > > > > > > 2. Do not include the ALPrd fallback which is a
homegrown
> >>>> > > > dictionary
> >>>> > > > > > > > encoding without dictionary reuse accross pages, and
> >>>> instead
> >>>> > rely
> >>>> > > > on
> >>>> > > > > a
> >>>> > > > > > > > well-known Parquet encoding (such as
BYTE_STREAM_SPLIT?)
> >>>> > > > > > > >
> >>>> > > > > > > > 3. Replace the FOR encoding inside ALP, which aims at
> >>>> > compressing
> >>>> > > > > > > > integers efficiently, with our own DELTA_BINARY_PACKED
> >>>> (which
> >>>> > has
> >>>> > > > the
> >>>> > > > > > > > same qualities and is already available in Parquet
> >>>> > > implementations)
> >>>> > > > > > > >
> >>>> > > > > > > > Regards
> >>>> > > > > > > >
> >>>> > > > > > > > Antoine.
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > > > On Thu, 16 Oct 2025 14:47:33 -0700
> >>>> > > > > > > > PRATEEK GAUR <[email protected]> wrote:
> >>>> > > > > > > > > Hi team,
> >>>> > > > > > > > >
> >>>> > > > > > > > > We spent some time evaluating ALP compression and
> >>>> > decompression
> >>>> > > > > > > compared
> >>>> > > > > > > > to
> >>>> > > > > > > > > other encoding alternatives like CHIMP/GORILLA and
> >>>> > compression
> >>>> > > > > > > techniques
> >>>> > > > > > > > > like SNAPPY/LZ4/ZSTD. We presented these numbers to
the
> >>>> > > community
> >>>> > > > > > > members
> >>>> > > > > > > > > on October 15th in the biweekly parquet meeting. ( I
> >>>> can't
> >>>> > seem
> >>>> > > > > to
> >>>> > > > > > > access
> >>>> > > > > > > > > the recording, so please let me know what access
rules
> >>>> I need
> >>>> > > to
> >>>> > > > > get to
> >>>> > > > > > > > be
> >>>> > > > > > > > > able to view it )
> >>>> > > > > > > > >
> >>>> > > > > > > > > We did this evaluation over some datasets pointed by
> >>>> the ALP
> >>>> > > > paper
> >>>> > > > > and
> >>>> > > > > > > > some
> >>>> > > > > > > > > pointed by the parquet community.
> >>>> > > > > > > > >
> >>>> > > > > > > > > The results are available in the following document
> >>>> > > > > > > > > <
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >>>> > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > > > :
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
> >>>> > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > > > Based on the numbers we see
> >>>> > > > > > > > >
> >>>> > > > > > > > >    -  ALP is comparable to ZSTD(level=1) in terms of
> >>>> > > compression
> >>>> > > > > ratio
> >>>> > > > > > > > and
> >>>> > > > > > > > >    much better compared to other schemes. (numbers in
> >>>> the
> >>>> > sheet
> >>>> > > > > are
> >>>> > > > > > > bytes
> >>>> > > > > > > > >    needed to encode each value )
> >>>> > > > > > > > >    - ALP going quite well in terms of decompression
> >>>> speed
> >>>> > > > (numbers
> >>>> > > > > in
> >>>> > > > > > > the
> >>>> > > > > > > > >    sheet are bytes decompressed per second)
> >>>> > > > > > > > >
> >>>> > >
[message truncated...]

RE: Re: [Parquet] ALP Encoding for Floating point data

Reply via email to