Parquet and filtering

Adam Gilmore Mon, 05 Jan 2015 06:16:55 -0800

Hi guys,

I have a question re Parquet.  I'm not sure if this is a Drill question or
Parquet, but thought I'd start here.


I have a sample dataset of ~100M rows in a Parquet file.  It's quick to sum
a single column across the whole dataset.

I have a column which has approx 100 unique values (e.g. a customer ID).
When I filter on that column by one of those values (to reduce the set to
~1M values), the query takes longer.

This doesn't make a lot of sense to me - I would have expected the Parquet
format to only bring back segments that match that and only sum those
values.  I would expect that this would make the query magnitudes faster,
not slower.

Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore)
have acted this way, so I can't quite understand why Parquet doesn't act
the same.

Can anyone suggest what I'm doing wrong?

Parquet and filtering

Reply via email to