Hello all,

I agree with the sentiments expressed by Micah.

* a lossy algorithm is more difficult to reason about for users;
* it cannot be enabled by default, for obvious reasons;
* the min/max statistics values should remain correct, that is: min
  should be a lower bound, max an upper bound;
* adding niche encodings does not seem particularly attractive for the
  Parquet ecosystem and the maintainers of the various Parquet
  implementations.

I would add that the encoding should either be very easy to understand
and implementation (such as BYTE_STREAM_SPLIT), or already
well-established in the software ecosystem.

Given the above, I also think there should be a clear proof that this
encoding brings very significant benefits over the statu quo. I would
suggest a comparison between the following combinations:

* PLAIN encoding
* PLAIN encoding + lz4 (or snappy)
* PLAIN encoding + zstd
* BYTE_STREAM_SPLIT encoding + lz4 (or snappy)
* BYTE_STREAM_SPLIT encoding + zstd
* SZ encoding
* SZ encoding + lz4 (or snappy)
* SZ encoding + zstd

The comparison should show the compression ratio,
encoding+compression speed, and decompression+decoding speed.

Regards

Antoine.



On Fri, 3 Nov 2023 15:04:29 +0000
Michael Bernardi <[email protected]> wrote:
> Dear all,
> 
> Myself and others at the Technical University of Munich are interested adding 
> a new lossy compression algorithm to the Parquet format to support the 
> compression of floating point data. This is a continuation of the work by 
> Martin Radev. Here are some related links:
> 
> Email thread: https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
> Report: https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> 
> This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT 
> encoding, which allows the lossless compression algorithms to better compress 
> floating point data.
> 
> Martin's report also investigated lossy compressors which can be supplied 
> with an error bound and depending on this bound deliver much higher 
> compression ratios for similar computing time. The SZ compression library was 
> found to be quite promising, but it was discounted at the time due to issues 
> with thread safety and the API being immature. In the meantime these issues 
> have largely been resolved and it's now possible to use SZ with HDF5 (see the 
> link below). Therefore I'd like to reconsider adding it (or another similar 
> algorithm) to Parquet.
> 
> https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3
> 
> Whatever lossy compression method we choose, it would probably have to be 
> implemented as a Parquet encoding rather than a compression for a couple 
> reasons:
> 
> 1) The algorithm can only compress a flat buffer of floating point data. It's 
> therefore not possible to use it for whole file compression and must be used 
> only on individual columns.
> 2) If it were implemented as a compression, it would conflict with underlying 
> encodings which would make the floating point values unreadable to the 
> algorithm.
> 
> Note that introducing lossy compression could introduce a situation where 
> values like min and max in the statistics page might not be found in the 
> decompressed data. There are probably other considerations here that I've 
> missed.
> 
> I look forward to reading your response.
> 
> Best regards,
> Michael Bernardi
> 
> 



Reply via email to