Yep. And doing some compression during encoding isn't really a new thing. For instance, on the air quality dataset, "uncompressed" Parquet gets a compression ratio of about 3.6. Existing encodings sometimes take deltas or varints to reduce integer data size.
On Wed, Jan 3, 2024 at 12:14 AM wish maple <[email protected]> wrote: > Hi Martin, > > Parquet has "Compression" and "Encoding" parts. So, this new > method is a part of integer/float-point encoding, but also doing some > compression workload? > > Best, > Xuwei Fu > > Martin Loncaric <[email protected]> 于2024年1月3日周三 13:10写道: > > > I'd like to propose and get feedback on a new encoding for numerical > > columns: pco. I just did a blog post demonstrating how this would perform > > on various real-world datasets > > <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR: > > pco > > losslessly achieves much better compression ratio (44-158% higher) and > > slightly faster decompression speed than zstd-compressed Parquet. On the > > other hand, it compresses somewhat slower at default compression level, > but > > I think this difference may disappear in future updates. > > > > I think supporting this optional encoding would be an enormous win, but > I'm > > not blind to the difficulties of implementing it: > > * Writing a good JVM implementation would be very difficult, so we'd > > probably have to make a JNI library. > > * Pco must be compressed one "chunk" (probably one per Parquet data page) > > at a time, with no way to estimate the encoded size until it has already > > done >50% of the compression work. I suspect the best solution is to > split > > pco data pages based on unencoded size, which is different from existing > > encodings. I think this makes sense since pco fulfills the role usually > > played by compression in Parquet. > > > > Please let me know what you think of this idea. > > > > Thanks, > > Martin > > >
