Added to my to-do list. I'm debugging our Parquet v2 page reader code
at the moment, then I'll do a combined post about "Parquet improvements".
On 2021/09/29 16:46, Ted Dunning wrote:
A blog is a great idea.
I am curious about how much compression costs.
On Wed, Sep 29, 2021 at 5:37 AM luoc <[email protected]> wrote:
James, you are doing fine.
Is it possible to post a new blog in the website for this?
在 2021年9月29日,20:27,James Turton <[email protected]> 写道:
Hi all
We've got support for reading and writing using additional Parquet
compression codecs in master now. Here are the footprints of a 25M record
dataset compressed by Drill with different codecs.
| Codec | Size on disk (Mb) |
| ------ | ----------------- |
| brotli | 87 |
| gzip | 80 |
| lz4 | 100.6 |
| lzo | 100.8 |
| snappy | 192 |
| zstd | 85 |
| none | 2152 |
I haven't made measurements of (de)compression speed differences myself
but there are many such benchmarks around on the web, and the differences
can be big *if* you've got a workload that is CPU bound by
(de)compression. Beyond that there are the usual considerations like
better utilisation of the OS page cache by the higher compression ratio
codecs, less I/O when data must come from disk, etc. Zstd is probably the
one I'll be putting into `store.parquet.compression` myself at this point.
Happy Drilling!
James