Re: Lossy compression of floating point data

Michael Bernardi Tue, 21 Nov 2023 03:05:00 -0800

Hi all,

Thank you for your thoughts.

In the meantime I've done further experiments and decided against using SZ in 
its current state. The library has improved, but while experimenting with it 
I've noticed some issues which don't make it a good fit for inclusion in 
Parquet. To Gang's questions, SZ has reasonable performance and supports 
OpenMP. You can use it with HDF5, however this support is provided by means of 
an external plugin. This would be a better way of supporting SZ in Parquet if 
it were possible. I no longer think it would be a good idea to include it in 
Parquet itself.

All the lossy floating point compressors I've investigated work via some form 
of mathematical transformation or prediction followed by quantisation, 
encoding, and then finally lossless compression. In our use case of single 
variable time series data, skipping the transformation/prediction stages often 
provides most of the benefit anyway. Therefore I'd like to investigate adding a 
simple encoding scheme which quantises and encodes the floating point values in 
some way respecting a user defined error bound. Similar to the 
BYTE_STREAM_SPLIT encoding, this would leverage the lossless compression 
algorithms already present in Parquet. If and when I have something that is 
easy to integrate and useful, I'll make a new proposal.

Regards,
Michael

________________________________
From: Gang Wu <[email protected]>
Sent: Monday, 13 November 2023 7:03:41 AM
To: [email protected]
Cc: [email protected]
Subject: Re: Lossy compression of floating point data

Hi Michael,

It seems that the datasets used in the experiment are very large.
The compression unit in Parquet is a page, which usually contains
20,000 values or no more than 1MB in terms of raw size. So I agree
with Antoine that we need more real case experiments to compare
what parquet can currently offer. Other than that, the report says
some floating point compressors do not provide optimal implementation.
I'm not sure if SZ is fully optimized or if it can benefits from any
hardware optimization? Besides that, is there any adoption of SZ in
other systems or libraries?

Best,
Gang

On Thu, Nov 9, 2023 at 10:48 PM Antoine Pitrou <[email protected]> wrote:

>
> Hello all,
>
> I agree with the sentiments expressed by Micah.
>
> * a lossy algorithm is more difficult to reason about for users;
> * it cannot be enabled by default, for obvious reasons;
> * the min/max statistics values should remain correct, that is: min
>   should be a lower bound, max an upper bound;
> * adding niche encodings does not seem particularly attractive for the
>   Parquet ecosystem and the maintainers of the various Parquet
>   implementations.
>
> I would add that the encoding should either be very easy to understand
> and implementation (such as BYTE_STREAM_SPLIT), or already
> well-established in the software ecosystem.
>
> Given the above, I also think there should be a clear proof that this
> encoding brings very significant benefits over the statu quo. I would
> suggest a comparison between the following combinations:
>
> * PLAIN encoding
> * PLAIN encoding + lz4 (or snappy)
> * PLAIN encoding + zstd
> * BYTE_STREAM_SPLIT encoding + lz4 (or snappy)
> * BYTE_STREAM_SPLIT encoding + zstd
> * SZ encoding
> * SZ encoding + lz4 (or snappy)
> * SZ encoding + zstd
>
> The comparison should show the compression ratio,
> encoding+compression speed, and decompression+decoding speed.
>
> Regards
>
> Antoine.
>
>
>
> On Fri, 3 Nov 2023 15:04:29 +0000
> Michael Bernardi <[email protected]> wrote:
> > Dear all,
> >
> > Myself and others at the Technical University of Munich are interested
> adding a new lossy compression algorithm to the Parquet format to support
> the compression of floating point data. This is a continuation of the work
> by Martin Radev. Here are some related links:
> >
> > Email thread:
> https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
> > Report:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> >
> > This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT
> encoding, which allows the lossless compression algorithms to better
> compress floating point data.
> >
> > Martin's report also investigated lossy compressors which can be
> supplied with an error bound and depending on this bound deliver much
> higher compression ratios for similar computing time. The SZ compression
> library was found to be quite promising, but it was discounted at the time
> due to issues with thread safety and the API being immature. In the
> meantime these issues have largely been resolved and it's now possible to
> use SZ with HDF5 (see the link below). Therefore I'd like to reconsider
> adding it (or another similar algorithm) to Parquet.
> >
> >
> https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3
> >
> > Whatever lossy compression method we choose, it would probably have to
> be implemented as a Parquet encoding rather than a compression for a couple
> reasons:
> >
> > 1) The algorithm can only compress a flat buffer of floating point data.
> It's therefore not possible to use it for whole file compression and must
> be used only on individual columns.
> > 2) If it were implemented as a compression, it would conflict with
> underlying encodings which would make the floating point values unreadable
> to the algorithm.
> >
> > Note that introducing lossy compression could introduce a situation
> where values like min and max in the statistics page might not be found in
> the decompressed data. There are probably other considerations here that
> I've missed.
> >
> > I look forward to reading your response.
> >
> > Best regards,
> > Michael Bernardi
> >
> >
>
>
>
>

Re: Lossy compression of floating point data

Reply via email to