Hi all, Thank you for your thoughts.
In the meantime I've done further experiments and decided against using SZ in its current state. The library has improved, but while experimenting with it I've noticed some issues which don't make it a good fit for inclusion in Parquet. To Gang's questions, SZ has reasonable performance and supports OpenMP. You can use it with HDF5, however this support is provided by means of an external plugin. This would be a better way of supporting SZ in Parquet if it were possible. I no longer think it would be a good idea to include it in Parquet itself. All the lossy floating point compressors I've investigated work via some form of mathematical transformation or prediction followed by quantisation, encoding, and then finally lossless compression. In our use case of single variable time series data, skipping the transformation/prediction stages often provides most of the benefit anyway. Therefore I'd like to investigate adding a simple encoding scheme which quantises and encodes the floating point values in some way respecting a user defined error bound. Similar to the BYTE_STREAM_SPLIT encoding, this would leverage the lossless compression algorithms already present in Parquet. If and when I have something that is easy to integrate and useful, I'll make a new proposal. Regards, Michael ________________________________ From: Gang Wu <[email protected]> Sent: Monday, 13 November 2023 7:03:41 AM To: [email protected] Cc: [email protected] Subject: Re: Lossy compression of floating point data Hi Michael, It seems that the datasets used in the experiment are very large. The compression unit in Parquet is a page, which usually contains 20,000 values or no more than 1MB in terms of raw size. So I agree with Antoine that we need more real case experiments to compare what parquet can currently offer. Other than that, the report says some floating point compressors do not provide optimal implementation. I'm not sure if SZ is fully optimized or if it can benefits from any hardware optimization? Besides that, is there any adoption of SZ in other systems or libraries? Best, Gang On Thu, Nov 9, 2023 at 10:48 PM Antoine Pitrou <[email protected]> wrote: > > Hello all, > > I agree with the sentiments expressed by Micah. > > * a lossy algorithm is more difficult to reason about for users; > * it cannot be enabled by default, for obvious reasons; > * the min/max statistics values should remain correct, that is: min > should be a lower bound, max an upper bound; > * adding niche encodings does not seem particularly attractive for the > Parquet ecosystem and the maintainers of the various Parquet > implementations. > > I would add that the encoding should either be very easy to understand > and implementation (such as BYTE_STREAM_SPLIT), or already > well-established in the software ecosystem. > > Given the above, I also think there should be a clear proof that this > encoding brings very significant benefits over the statu quo. I would > suggest a comparison between the following combinations: > > * PLAIN encoding > * PLAIN encoding + lz4 (or snappy) > * PLAIN encoding + zstd > * BYTE_STREAM_SPLIT encoding + lz4 (or snappy) > * BYTE_STREAM_SPLIT encoding + zstd > * SZ encoding > * SZ encoding + lz4 (or snappy) > * SZ encoding + zstd > > The comparison should show the compression ratio, > encoding+compression speed, and decompression+decoding speed. > > Regards > > Antoine. > > > > On Fri, 3 Nov 2023 15:04:29 +0000 > Michael Bernardi <[email protected]> wrote: > > Dear all, > > > > Myself and others at the Technical University of Munich are interested > adding a new lossy compression algorithm to the Parquet format to support > the compression of floating point data. This is a continuation of the work > by Martin Radev. Here are some related links: > > > > Email thread: > https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn > > Report: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv > > > > This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT > encoding, which allows the lossless compression algorithms to better > compress floating point data. > > > > Martin's report also investigated lossy compressors which can be > supplied with an error bound and depending on this bound deliver much > higher compression ratios for similar computing time. The SZ compression > library was found to be quite promising, but it was discounted at the time > due to issues with thread safety and the API being immature. In the > meantime these issues have largely been resolved and it's now possible to > use SZ with HDF5 (see the link below). Therefore I'd like to reconsider > adding it (or another similar algorithm) to Parquet. > > > > > https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3 > > > > Whatever lossy compression method we choose, it would probably have to > be implemented as a Parquet encoding rather than a compression for a couple > reasons: > > > > 1) The algorithm can only compress a flat buffer of floating point data. > It's therefore not possible to use it for whole file compression and must > be used only on individual columns. > > 2) If it were implemented as a compression, it would conflict with > underlying encodings which would make the floating point values unreadable > to the algorithm. > > > > Note that introducing lossy compression could introduce a situation > where values like min and max in the statistics page might not be found in > the decompressed data. There are probably other considerations here that > I've missed. > > > > I look forward to reading your response. > > > > Best regards, > > Michael Bernardi > > > > > > > >
