Hi guys, I have a question re Parquet. I'm not sure if this is a Drill question or Parquet, but thought I'd start here.
I have a sample dataset of ~100M rows in a Parquet file. It's quick to sum a single column across the whole dataset. I have a column which has approx 100 unique values (e.g. a customer ID). When I filter on that column by one of those values (to reduce the set to ~1M values), the query takes longer. This doesn't make a lot of sense to me - I would have expected the Parquet format to only bring back segments that match that and only sum those values. I would expect that this would make the query magnitudes faster, not slower. Other columnar formats I've used (e.g. ORCFile, SQL Server Columnstore) have acted this way, so I can't quite understand why Parquet doesn't act the same. Can anyone suggest what I'm doing wrong?