Yep. And doing some compression during encoding isn't really a new thing.
For instance, on the air quality dataset, "uncompressed" Parquet gets a
compression ratio of about 3.6. Existing encodings sometimes take deltas or
varints to reduce integer data size.

On Wed, Jan 3, 2024 at 12:14 AM wish maple <[email protected]> wrote:

> Hi Martin,
>
> Parquet has "Compression" and "Encoding" parts. So, this new
> method is a part of integer/float-point encoding, but also doing some
> compression workload?
>
> Best,
> Xuwei Fu
>
> Martin Loncaric <[email protected]> 于2024年1月3日周三 13:10写道:
>
> > I'd like to propose and get feedback on a new encoding for numerical
> > columns: pco. I just did a blog post demonstrating how this would perform
> > on various real-world datasets
> > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR:
> > pco
> > losslessly achieves much better compression ratio (44-158% higher) and
> > slightly faster decompression speed than zstd-compressed Parquet. On the
> > other hand, it compresses somewhat slower at default compression level,
> but
> > I think this difference may disappear in future updates.
> >
> > I think supporting this optional encoding would be an enormous win, but
> I'm
> > not blind to the difficulties of implementing it:
> > * Writing a good JVM implementation would be very difficult, so we'd
> > probably have to make a JNI library.
> > * Pco must be compressed one "chunk" (probably one per Parquet data page)
> > at a time, with no way to estimate the encoded size until it has already
> > done >50% of the compression work. I suspect the best solution is to
> split
> > pco data pages based on unencoded size, which is different from existing
> > encodings. I think this makes sense since pco fulfills the role usually
> > played by compression in Parquet.
> >
> > Please let me know what you think of this idea.
> >
> > Thanks,
> > Martin
> >
>

Reply via email to