I have seen some discussions on the Parquet storage format suggesting that sorting time series data on the time key prior to converting to the Parquet format will improve range query efficiency via min/max values on column chunks - perhaps analogous to skip indexes?

Is this a recommended approach for data accessed via Drill?

In addition, for data stored in HDFS for Drill that has a regular growth rate and is mainly subject to time range queries, is there validity to partitioning it by date into subdirectories?

For example, in PostgreSQL, I might partition data tables by month so queries including the partition date column hit the proper partitions directly (with an extra benefit of space management that does not touch all date ranges).

Segmenting data into directories in HDFS would require clients to structure queries accordingly, but would there be benefit in reduced query time by limiting scan ranges?

Reply via email to