Dear all,

Myself and others at the Technical University of Munich are interested adding a 
new lossy compression algorithm to the Parquet format to support the 
compression of floating point data. This is a continuation of the work by 
Martin Radev. Here are some related links:

Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv

This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT 
encoding, which allows the lossless compression algorithms to better compress 
floating point data.

Martin's report also investigated lossy compressors which can be supplied with 
an error bound and depending on this bound deliver much higher compression 
ratios for similar computing time. The SZ compression library was found to be 
quite promising, but it was discounted at the time due to issues with thread 
safety and the API being immature. In the meantime these issues have largely 
been resolved and it's now possible to use SZ with HDF5 (see the link below). 
Therefore I'd like to reconsider adding it (or another similar algorithm) to 
Parquet.

https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3

Whatever lossy compression method we choose, it would probably have to be 
implemented as a Parquet encoding rather than a compression for a couple 
reasons:

1) The algorithm can only compress a flat buffer of floating point data. It's 
therefore not possible to use it for whole file compression and must be used 
only on individual columns.
2) If it were implemented as a compression, it would conflict with underlying 
encodings which would make the floating point values unreadable to the 
algorithm.

Note that introducing lossy compression could introduce a situation where 
values like min and max in the statistics page might not be found in the 
decompressed data. There are probably other considerations here that I've 
missed.

I look forward to reading your response.

Best regards,
Michael Bernardi

Reply via email to