Hey Steve,

Regarding

>Wing Yew mentioned compression today -should we switch the default from
snappy to zstd.

New tables created with the Iceberg reference implementation since 1.4.0+
actually default to ZSTD already. Though I'd need to double check what the
other implementations do, but I'd be surprised if it's any different.

Thanks,
Amogh Jahagirdar

On Fri, Jan 30, 2026 at 12:42 PM Steve Loughran <[email protected]> wrote:

>
> Wing Yew mentioned compression today -should we switch the default from
> snappy to zstd.
>
> Just an FYI, Parquet are looking at different encryption for float
> data,  Adaptive Lossless Floating Point Compression.
>
> AI datasets are really encouraging a focus here; these results don't
> immediately advocate adoption and switch...it depends on the decode
> performance.
>
> ---------- Forwarded message ---------
> From: PRATEEK GAUR <[email protected]>
> Date: Thu, 16 Oct 2025 at 22:49
> Subject: [Parquet] ALP Encoding for Floating point data
> To: <[email protected]>
>
>
> Hi team,
>
> We spent some time evaluating ALP compression and decompression compared to
> other encoding alternatives like CHIMP/GORILLA and compression techniques
> like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members
> on October 15th in the biweekly parquet meeting. ( I can't seem to access
> the recording, so please let me know what access rules I need to get to be
> able to view it )
>
> We did this evaluation over some datasets pointed by the ALP paper and some
> pointed by the parquet community.
>
> The results are available in the following document
> <
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0
> >
> :
>
> https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg
>
> Based on the numbers we see
>
>    -  ALP is comparable to ZSTD(level=1) in terms of compression ratio and
>    much better compared to other schemes. (numbers in the sheet are bytes
>    needed to encode each value )
>    - ALP going quite well in terms of decompression speed (numbers in the
>    sheet are bytes decompressed per second)
>
> As next steps we will
>
>    - Get the numbers for compression on top of byte stream split.
>    - Evaluate the algorithm over a few more datasets.
>    - Have an implementation in the arrow-parquet repo.
>
> Looking forward to feedback from the community.
>
> Best
> Prateek and Dhirhan
>

Reply via email to