Hello all,
I agree with the sentiments expressed by Micah. * a lossy algorithm is more difficult to reason about for users; * it cannot be enabled by default, for obvious reasons; * the min/max statistics values should remain correct, that is: min should be a lower bound, max an upper bound; * adding niche encodings does not seem particularly attractive for the Parquet ecosystem and the maintainers of the various Parquet implementations. I would add that the encoding should either be very easy to understand and implementation (such as BYTE_STREAM_SPLIT), or already well-established in the software ecosystem. Given the above, I also think there should be a clear proof that this encoding brings very significant benefits over the statu quo. I would suggest a comparison between the following combinations: * PLAIN encoding * PLAIN encoding + lz4 (or snappy) * PLAIN encoding + zstd * BYTE_STREAM_SPLIT encoding + lz4 (or snappy) * BYTE_STREAM_SPLIT encoding + zstd * SZ encoding * SZ encoding + lz4 (or snappy) * SZ encoding + zstd The comparison should show the compression ratio, encoding+compression speed, and decompression+decoding speed. Regards Antoine. On Fri, 3 Nov 2023 15:04:29 +0000 Michael Bernardi <[email protected]> wrote: > Dear all, > > Myself and others at the Technical University of Munich are interested adding > a new lossy compression algorithm to the Parquet format to support the > compression of floating point data. This is a continuation of the work by > Martin Radev. Here are some related links: > > Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn > Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT > encoding, which allows the lossless compression algorithms to better compress > floating point data. > > Martin's report also investigated lossy compressors which can be supplied > with an error bound and depending on this bound deliver much higher > compression ratios for similar computing time. The SZ compression library was > found to be quite promising, but it was discounted at the time due to issues > with thread safety and the API being immature. In the meantime these issues > have largely been resolved and it's now possible to use SZ with HDF5 (see the > link below). Therefore I'd like to reconsider adding it (or another similar > algorithm) to Parquet. > > https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3 > > Whatever lossy compression method we choose, it would probably have to be > implemented as a Parquet encoding rather than a compression for a couple > reasons: > > 1) The algorithm can only compress a flat buffer of floating point data. It's > therefore not possible to use it for whole file compression and must be used > only on individual columns. > 2) If it were implemented as a compression, it would conflict with underlying > encodings which would make the floating point values unreadable to the > algorithm. > > Note that introducing lossy compression could introduce a situation where > values like min and max in the statistics page might not be found in the > decompressed data. There are probably other considerations here that I've > missed. > > I look forward to reading your response. > > Best regards, > Michael Bernardi > >
