Sorting and partitioning for range scans?

2015-06-01 Thread Matt
I have seen some discussions on the Parquet storage format suggesting 
that sorting time series data on the time key prior to converting to the 
Parquet format will improve range query efficiency via min/max values on 
column chunks - perhaps analogous to skip indexes?


Is this a recommended approach for data accessed via Drill?

In addition, for data stored in HDFS for Drill that has a regular growth 
rate and is mainly subject to time range queries, is there validity to 
partitioning it by date into subdirectories?


For example, in PostgreSQL, I might partition data tables by month so 
queries including the partition date column hit the proper partitions 
directly (with an extra benefit of space management that does not touch 
all date ranges).


Segmenting data into directories in HDFS would require clients to 
structure queries accordingly, but would there be benefit in reduced 
query time by limiting scan ranges?


Re: Sorting and partitioning for range scans?

2015-06-01 Thread Paul Mogren
On 6/1/15, 12:14 PM, Matt bsg...@gmail.com wrote:


Segmenting data into directories in HDFS would require clients to
structure queries accordingly, but would there be benefit in reduced
query time by limiting scan ranges?

Yes. I am just a newbie user, but I have already seen that work with
localFS and S3; I fully expect it will work for HDFS also, as I have seen
mention of such a strategy for HDFS outside the context of Drill. Ignorant
clients can also still query the root directory and just not get the
benefit. I believe you could even define a view that would allow clients
to apply WHERE clause filters against artificial columns of date
information that you map to the directory structure, thereby hiding the
structure from the client.

HTH,
Paul