Re: Pitch for Pcodec Encoding in Parquet

Micah Kornfield Wed, 03 Jan 2024 10:44:14 -0800

Hi Martin,
The results are impressive. However I'll point you to a recent prior
discussion on a proposed new encoding/compression technique
<https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv> [1].
While this seems to avoid the lossiness concerns. There are also suggested
benchmarks to use for comparison.


I think there are still two issues that apply here:

1.  Requiring a Rust Tool Chain (apologies but this seems to be Rust only
at the moment) and FFI for Java and other non-Rust implementations I think
makes it much harder for other implementations to adopt this encoding.
 For instance, my organization does not currently allow Rust code in
production.
2.  This seems like something relatively new and not well established in
the ecosystem, giving it a higher risk of on-going support.

Thanks,
Micah


[1] https://lists.apache.org/thread/z8fnoq3lm5t67rfz74fwzj5qytzyy4gv

On Tue, Jan 2, 2024 at 9:10 PM Martin Loncaric <[email protected]>
wrote:

> I'd like to propose and get feedback on a new encoding for numerical
> columns: pco. I just did a blog post demonstrating how this would perform
> on various real-world datasets
> <https://graphallthethings.com/posts/the-parquet-we-could-have>. TL;DR:
> pco
> losslessly achieves much better compression ratio (44-158% higher) and
> slightly faster decompression speed than zstd-compressed Parquet. On the
> other hand, it compresses somewhat slower at default compression level, but
> I think this difference may disappear in future updates.
>
> I think supporting this optional encoding would be an enormous win, but I'm
> not blind to the difficulties of implementing it:
> * Writing a good JVM implementation would be very difficult, so we'd
> probably have to make a JNI library.
> * Pco must be compressed one "chunk" (probably one per Parquet data page)
> at a time, with no way to estimate the encoded size until it has already
> done >50% of the compression work. I suspect the best solution is to split
> pco data pages based on unencoded size, which is different from existing
> encodings. I think this makes sense since pco fulfills the role usually
> played by compression in Parquet.
>
> Please let me know what you think of this idea.
>
> Thanks,
> Martin
>

Re: Pitch for Pcodec Encoding in Parquet

Reply via email to