Wing Yew mentioned compression today -should we switch the default from
snappy to zstd.

Just an FYI, Parquet are looking at different encryption for float
data,  Adaptive Lossless Floating Point Compression.

AI datasets are really encouraging a focus here; these results don't
immediately advocate adoption and switch...it depends on the decode
performance.

---------- Forwarded message ---------
From: PRATEEK GAUR <[email protected]>
Date: Thu, 16 Oct 2025 at 22:49
Subject: [Parquet] ALP Encoding for Floating point data
To: <[email protected]>


Hi team,

We spent some time evaluating ALP compression and decompression compared to
other encoding alternatives like CHIMP/GORILLA and compression techniques
like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
on October 15th in the biweekly parquet meeting. ( I can't seem to access
the recording, so please let me know what access rules I need to get to be
able to view it )

We did this evaluation over some datasets pointed by the ALP paper and some
pointed by the parquet community.

The results are available in the following document
<
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
>
:
https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg

Based on the numbers we see

   -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
   much better compared to other schemes. (numbers in the sheet are bytes
   needed to encode each value )
   - ALP going quite well in terms of decompression speed (numbers in the
   sheet are bytes decompressed per second)

As next steps we will

   - Get the numbers for compression on top of byte stream split.
   - Evaluate the algorithm over a few more datasets.
   - Have an implementation in the arrow-parquet repo.

Looking forward to feedback from the community.

Best
Prateek and Dhirhan

Reply via email to