Hi Michael,

It seems that the datasets used in the experiment are very large.
The compression unit in Parquet is a page, which usually contains
20,000 values or no more than 1MB in terms of raw size. So I agree
with Antoine that we need more real case experiments to compare
what parquet can currently offer. Other than that, the report says
some floating point compressors do not provide optimal implementation.
I'm not sure if SZ is fully optimized or if it can benefits from any
hardware optimization? Besides that, is there any adoption of SZ in
other systems or libraries?

Best,
Gang

On Thu, Nov 9, 2023 at 10:48 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello all,
>
> I agree with the sentiments expressed by Micah.
>
> * a lossy algorithm is more difficult to reason about for users;
> * it cannot be enabled by default, for obvious reasons;
> * the min/max statistics values should remain correct, that is: min
>   should be a lower bound, max an upper bound;
> * adding niche encodings does not seem particularly attractive for the
>   Parquet ecosystem and the maintainers of the various Parquet
>   implementations.
>
> I would add that the encoding should either be very easy to understand
> and implementation (such as BYTE_STREAM_SPLIT), or already
> well-established in the software ecosystem.
>
> Given the above, I also think there should be a clear proof that this
> encoding brings very significant benefits over the statu quo. I would
> suggest a comparison between the following combinations:
>
> * PLAIN encoding
> * PLAIN encoding + lz4 (or snappy)
> * PLAIN encoding + zstd
> * BYTE_STREAM_SPLIT encoding + lz4 (or snappy)
> * BYTE_STREAM_SPLIT encoding + zstd
> * SZ encoding
> * SZ encoding + lz4 (or snappy)
> * SZ encoding + zstd
>
> The comparison should show the compression ratio,
> encoding+compression speed, and decompression+decoding speed.
>
> Regards
>
> Antoine.
>
>
>
> On Fri, 3 Nov 2023 15:04:29 +0000
> Michael Bernardi <michael.berna...@tum.de> wrote:
> > Dear all,
> >
> > Myself and others at the Technical University of Munich are interested
> adding a new lossy compression algorithm to the Parquet format to support
> the compression of floating point data. This is a continuation of the work
> by Martin Radev. Here are some related links:
> >
> > Email thread:
> https://lists.apache.org/thread/5hz040d4dd4ctk51qy11wojp2v5k2kxn
> > Report:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv
> >
> > This work ultimately resulted in the addition of the BYTE_STREAM_SPLIT
> encoding, which allows the lossless compression algorithms to better
> compress floating point data.
> >
> > Martin's report also investigated lossy compressors which can be
> supplied with an error bound and depending on this bound deliver much
> higher compression ratios for similar computing time. The SZ compression
> library was found to be quite promising, but it was discounted at the time
> due to issues with thread safety and the API being immature. In the
> meantime these issues have largely been resolved and it's now possible to
> use SZ with HDF5 (see the link below). Therefore I'd like to reconsider
> adding it (or another similar algorithm) to Parquet.
> >
> >
> https://github.com/szcompressor/SZ3/tree/d2a03eae45730997be64126961d7abda0f950791/tools/H5Z-SZ3
> >
> > Whatever lossy compression method we choose, it would probably have to
> be implemented as a Parquet encoding rather than a compression for a couple
> reasons:
> >
> > 1) The algorithm can only compress a flat buffer of floating point data.
> It's therefore not possible to use it for whole file compression and must
> be used only on individual columns.
> > 2) If it were implemented as a compression, it would conflict with
> underlying encodings which would make the floating point values unreadable
> to the algorithm.
> >
> > Note that introducing lossy compression could introduce a situation
> where values like min and max in the statistics page might not be found in
> the decompressed data. There are probably other considerations here that
> I've missed.
> >
> > I look forward to reading your response.
> >
> > Best regards,
> > Michael Bernardi
> >
> >
>
>
>
>

Reply via email to