Hi Michael,
Taking a quick scan of the repo it seems like there is only a C++
implementation for SZ3?  If so, I don't think this is a good candidate
unless the algorithm is easy to port to other languages.

As a secondary concern, I don't think that BYTE_STREAM_SPLIT has even been
widely adopted yet because it requires user action to enable.  Before
adding even more encoding options I think as a community we need to figure
out a policy for enabling encodings that we believe are overall useful.  I
worry about the overall utility vs maintainability of adding a lossy
compression encoding because overall this is something we could NEVER
reasonably enable by default, so it isn't clear how much adoption it would
gain

Also, it is a little bit hard to compare the lossy encodings to lossless
since they are in different tables, but if I am reading the report
correctly it looks like one of the main advantages is substantially better
compression (I might have missed it but I didn't see numbers on speeds
reported for lossy compression)?  If so I wonder if there might be other
lossless techniques that could be used to help close the gap for the target
use-case. (e.g. compression techniques better suited to bit streams OR
would the lossy values compress better using byte stream split, so maybe it
could be a preprocessing step instead of a new encoding).

Note that introducing lossy compression could introduce a situation where
> values like min and max in the statistics page might not be found in the
> decompressed data. There are probably other considerations here that I've
> missed.
>
> This is already the case today for other types (mainly byte_array) because
of potential truncation.

Thanks,
Micah

On Fri, Nov 3, 2023 at 8:06 AM Michael Bernardi <[email protected]>
wrote:

> Dear all,
>
> Myself and others at the Technical University of Munich are interested
> adding a new lossy compression algorithm to the Parquet format to support
> the compression of floating point data. This is a continuation of the work
> by Martin Radev. Here are some related links:
>
> Email thread:
> https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
> Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
>
> This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT
> encoding, which allows the lossless compression algorithms to better
> compress floating point data.
>
> Martin's report also investigated lossy compressors which can be supplied
> with an error bound and depending on this bound deliver much higher
> compression ratios for similar computing time. The SZ compression library
> was found to be quite promising, but it was discounted at the time due to
> issues with thread safety and the API being immature. In the meantime these
> issues have largely been resolved and it's now possible to use SZ with HDF5
> (see the link below). Therefore I'd like to reconsider adding it (or
> another similar algorithm) to Parquet.
>
>
> https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3
>
> Whatever lossy compression method we choose, it would probably have to be
> implemented as a Parquet encoding rather than a compression for a couple
> reasons:
>
> 1) The algorithm can only compress a flat buffer of floating point data.
> It's therefore not possible to use it for whole file compression and must
> be used only on individual columns.
> 2) If it were implemented as a compression, it would conflict with
> underlying encodings which would make the floating point values unreadable
> to the algorithm.
>
> Note that introducing lossy compression could introduce a situation where
> values like min and max in the statistics page might not be found in the
> decompressed data. There are probably other considerations here that I've
> missed.
>
> I look forward to reading your response.
>
> Best regards,
> Michael Bernardi
>
>

Reply via email to