Wing Yew mentioned compression today -should we switch the default from snappy to zstd.
Just an FYI, Parquet are looking at different encryption for float data, Adaptive Lossless Floating Point Compression. AI datasets are really encouraging a focus here; these results don't immediately advocate adoption and switch...it depends on the decode performance. ---------- Forwarded message --------- From: PRATEEK GAUR <[email protected]> Date: Thu, 16 Oct 2025 at 22:49 Subject: [Parquet] ALP Encoding for Floating point data To: <[email protected]> Hi team, We spent some time evaluating ALP compression and decompression compared to other encoding alternatives like CHIMP/GORILLA and compression techniques like SNAPPY/LZ4/ZSTD. We presented these numbers to the community members on October 15th in the biweekly parquet meeting. ( I can't seem to access the recording, so please let me know what access rules I need to get to be able to view it ) We did this evaluation over some datasets pointed by the ALP paper and some pointed by the parquet community. The results are available in the following document < https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg/edit?tab=t.0 > : https://docs.google.com/document/d/1PlyUSfqCqPVwNt8XA-CfRqsbc0NKRG0Kk1FigEm3JOg Based on the numbers we see - ALP is comparable to ZSTD(level=1) in terms of compression ratio and much better compared to other schemes. (numbers in the sheet are bytes needed to encode each value ) - ALP going quite well in terms of decompression speed (numbers in the sheet are bytes decompressed per second) As next steps we will - Get the numbers for compression on top of byte stream split. - Evaluate the algorithm over a few more datasets. - Have an implementation in the arrow-parquet repo. Looking forward to feedback from the community. Best Prateek and Dhirhan
