Re: Externally created Parquet files and partition pruning

2015-11-02 Thread Chris Mathews
@Jacques: >>> Jinfeng hit the nail on the head. If you have Parquet files with single >>> value columns (and have Parquet footer metadata stats), Drill will >>> automatically leverage the partitioning with zero additional setup required. So as it stands at the moment, does this mean Drill has to

Re: Externally created Parquet files and partition pruning

2015-10-27 Thread Jacques Nadeau
Sounds like a future optimization opportunity once someone has the hybrid issue and need. -- Jacques Nadeau CTO and Co-Founder, Dremio On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni wrote: > @Jacques, > > Steven probably could confirm whether my understanding of the code is > correct or not. From

Re: Externally created Parquet files and partition pruning

2015-10-25 Thread Jinfeng Ni
@Jacques, Steven probably could confirm whether my understanding of the code is correct or not. From the code, it seems we enforce the checking that only a column with unique value across all the files would be considered for pruning. I just tried two simple cases with TPC-H sample data. It seems

Re: Externally created Parquet files and partition pruning

2015-10-25 Thread Jacques Nadeau
Jinfeng hit the nail on the head. If you have Parquet files with single value columns (and have Parquet footer metadata stats), Drill will automatically leverage the partitioning with zero additional setup required. Jinfeng, based on what you said, it sounds as if we don't apply partitioning unles

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Chris Mathews
Thanks guys this is very helpful. I now need to go away and do some more research into this. Cheers -- Chris Sent from my iPhone > On 21 Oct 2015, at 21:32, Jinfeng Ni wrote: > > For each column in the parquet files, Drill will check column metadata > and see if min == ma

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Jinfeng Ni
For each column in the parquet files, Drill will check column metadata and see if min == max across all parquet files. If yes, that indicates this column has a unique value for all the files, and Drill will use that column as partitioning columns. The partitioning column could be a column specifie

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread rahul challapalli
Chris, Its not just sufficient to specify which column is the partition column. The data should also be organized accordingly. Below is a high level description of how partition pruning works with parquet files 1. Use CTAS with partition by clause : Here drill creates a single (or more) file for

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Chris Mathews
We create a JSON format schema for the Parquet file using the Avro specification and use this schema when loading data. Is there anything special we have to do to flag a column as a partitioning column ? Sorry I don’t understand your answer. What do you mean by ‘discover the columns with a sing

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Mehant Baid
The information is stored in the footer of the parquet files. Drill reads the metadata information stored in the parquet footer to discover the columns with a single value and treats them as partitioning columns. Thanks Mehant On 10/21/15 11:52 AM, Chris Mathews wrote: Thank Mehant; yes we di

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Chris Mathews
Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries. This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as t

Re: Externally created Parquet files and partition pruning

2015-10-21 Thread Mehant Baid
In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example Tab

Externally created Parquet files and partition pruning

2015-10-21 Thread Chris Mathews
We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill. Some questions: How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not