Dear all, Myself and others at the Technical University of Munich are interested adding a new lossy compression algorithm to the Parquet format to support the compression of floating point data. This is a continuation of the work by Martin Radev. Here are some related links:
Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT encoding, which allows the lossless compression algorithms to better compress floating point data. Martin's report also investigated lossy compressors which can be supplied with an error bound and depending on this bound deliver much higher compression ratios for similar computing time. The SZ compression library was found to be quite promising, but it was discounted at the time due to issues with thread safety and the API being immature. In the meantime these issues have largely been resolved and it's now possible to use SZ with HDF5 (see the link below). Therefore I'd like to reconsider adding it (or another similar algorithm) to Parquet. https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3 Whatever lossy compression method we choose, it would probably have to be implemented as a Parquet encoding rather than a compression for a couple reasons: 1) The algorithm can only compress a flat buffer of floating point data. It's therefore not possible to use it for whole file compression and must be used only on individual columns. 2) If it were implemented as a compression, it would conflict with underlying encodings which would make the floating point values unreadable to the algorithm. Note that introducing lossy compression could introduce a situation where values like min and max in the statistics page might not be found in the decompressed data. There are probably other considerations here that I've missed. I look forward to reading your response. Best regards, Michael Bernardi
