Lossy compression of floating point data

Michael Bernardi Fri, 03 Nov 2023 08:05:56 -0700

Dear all,

Myself and others at the Technical University of Munich are interested adding a 
new lossy compression algorithm to the Parquet format to support the 
compression of floating point data. This is a continuation of the work by 
Martin Radev. Here are some related links:

Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv

This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT
encoding, which allows the lossless compression algorithms to better compress
floating point data.

Martin's report also investigated lossy compressors which can be supplied with
an error bound and depending on this bound deliver much higher compression
ratios for similar computing time. The SZ compression library was found to be
quite promising, but it was discounted at the time due to issues with thread
safety and the API being immature. In the meantime these issues have largely
been resolved and it's now possible to use SZ with HDF5 (see the link below).
Therefore I'd like to reconsider adding it (or another similar algorithm) to
Parquet.

https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3

Whatever lossy compression method we choose, it would probably have to be
implemented as a Parquet encoding rather than a compression for a couple
reasons:

1) The algorithm can only compress a flat buffer of floating point data. It's
therefore not possible to use it for whole file compression and must be used
only on individual columns.
2) If it were implemented as a compression, it would conflict with underlying
encodings which would make the floating point values unreadable to the
algorithm.

Note that introducing lossy compression could introduce a situation where
values like min and max in the statistics page might not be found in the
decompressed data. There are probably other considerations here that I've
missed.

I look forward to reading your response.

Best regards,
Michael Bernardi

Lossy compression of floating point data

Reply via email to