Various question on Parquet

Stefán Baxter Mon, 11 Apr 2016 12:32:56 -0700

Hi,

We are using Parquet with Drill and we are quite happy, thank you all very
much.


We use Drill to query it and I wonder if there are some sort of best
practices, recommended setup or any tips you could share.

I also wanted to ask about some of the thing we think/hope are in scope and
what affect they will have on performance.

*Time-stamp support (bigint+delta_encoding) *
We are using Avro for inbound/fresh data and I believe 1.8 finally has
date/timestamp support and I wonder when Parquet will support
timestamp+mills in a more efficient (encoded) way.

*Predicate-Pushdown for dictionary values*
I hope I'm using the right terms but I'm basically referring to the ability
to skip segments if a the value being searched is not in the dictionary for
that segment (when/if dictionary encoding is used). I may be wrong in
thinking that this will speed up our queries quite a bit but I think our
date and some of our queries would.

*Bloom Filters*
I monitored some discussion here on implementing bloom filters and some
initial tests that were done to assess possible benefits. How did that go?
(Meaning will it be done and are there any initial numbers regarding
potential gain)

*Multi column overhead*
We are seeing that queries that fetch values from many columns are a lot
slower then the "same" queries when run with only a few columns. This is to
be expected but I wonder if there are any tricks/tips available here. We
are, for example, using nested structures that could be flattened but that
seems irrelevant.

Best regards,
 -Stefán

Various question on Parquet

Reply via email to