Sorting and partitioning for range scans?

Matt Mon, 01 Jun 2015 09:16:42 -0700

I have seen some discussions on the Parquet storage format suggestingthat sorting time series data on the time key prior to converting to theParquet format will improve range query efficiency via min/max values oncolumn chunks - perhaps analogous to skip indexes?


Is this a recommended approach for data accessed via Drill?

In addition, for data stored in HDFS for Drill that has a regular growthrate and is mainly subject to time range queries, is there validity topartitioning it by date into subdirectories?

For example, in PostgreSQL, I might partition data tables by month soqueries including the partition date column hit the proper partitionsdirectly (with an extra benefit of space management that does not touchall date ranges).

Segmenting data into directories in HDFS would require clients tostructure queries accordingly, but would there be benefit in reducedquery time by limiting scan ranges?

Sorting and partitioning for range scans?

Reply via email to