Hi Michael, Taking a quick scan of the repo it seems like there is only a C++ implementation for SZ3? If so, I don't think this is a good candidate unless the algorithm is easy to port to other languages.
As a secondary concern, I don't think that BYTE_STREAM_SPLIT has even been widely adopted yet because it requires user action to enable. Before adding even more encoding options I think as a community we need to figure out a policy for enabling encodings that we believe are overall useful. I worry about the overall utility vs maintainability of adding a lossy compression encoding because overall this is something we could NEVER reasonably enable by default, so it isn't clear how much adoption it would gain Also, it is a little bit hard to compare the lossy encodings to lossless since they are in different tables, but if I am reading the report correctly it looks like one of the main advantages is substantially better compression (I might have missed it but I didn't see numbers on speeds reported for lossy compression)? If so I wonder if there might be other lossless techniques that could be used to help close the gap for the target use-case. (e.g. compression techniques better suited to bit streams OR would the lossy values compress better using byte stream split, so maybe it could be a preprocessing step instead of a new encoding). Note that introducing lossy compression could introduce a situation where > values like min and max in the statistics page might not be found in the > decompressed data. There are probably other considerations here that I've > missed. > > This is already the case today for other types (mainly byte_array) because of potential truncation. Thanks, Micah On Fri, Nov 3, 2023 at 8:06 AM Michael Bernardi <[email protected]> wrote: > Dear all, > > Myself and others at the Technical University of Munich are interested > adding a new lossy compression algorithm to the Parquet format to support > the compression of floating point data. This is a continuation of the work > by Martin Radev. Here are some related links: > > Email thread: > https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn > Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT > encoding, which allows the lossless compression algorithms to better > compress floating point data. > > Martin's report also investigated lossy compressors which can be supplied > with an error bound and depending on this bound deliver much higher > compression ratios for similar computing time. The SZ compression library > was found to be quite promising, but it was discounted at the time due to > issues with thread safety and the API being immature. In the meantime these > issues have largely been resolved and it's now possible to use SZ with HDF5 > (see the link below). Therefore I'd like to reconsider adding it (or > another similar algorithm) to Parquet. > > > https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3 > > Whatever lossy compression method we choose, it would probably have to be > implemented as a Parquet encoding rather than a compression for a couple > reasons: > > 1) The algorithm can only compress a flat buffer of floating point data. > It's therefore not possible to use it for whole file compression and must > be used only on individual columns. > 2) If it were implemented as a compression, it would conflict with > underlying encodings which would make the floating point values unreadable > to the algorithm. > > Note that introducing lossy compression could introduce a situation where > values like min and max in the statistics page might not be found in the > decompressed data. There are probably other considerations here that I've > missed. > > I look forward to reading your response. > > Best regards, > Michael Bernardi > >
