@Jacques:
>>> Jinfeng hit the nail on the head. If you have Parquet files with single
>>> value columns (and have Parquet footer metadata stats), Drill will
>>> automatically leverage the partitioning with zero additional setup required.
So as it stands at the moment, does this mean Drill has to
Sounds like a future optimization opportunity once someone has the hybrid
issue and need.
--
Jacques Nadeau
CTO and Co-Founder, Dremio
On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni wrote:
> @Jacques,
>
> Steven probably could confirm whether my understanding of the code is
> correct or not. From
@Jacques,
Steven probably could confirm whether my understanding of the code is
correct or not. From the code, it seems we enforce the checking that
only a column with unique value across all the files would be
considered for pruning.
I just tried two simple cases with TPC-H sample data. It seems
Jinfeng hit the nail on the head. If you have Parquet files with single
value columns (and have Parquet footer metadata stats), Drill will
automatically leverage the partitioning with zero additional setup
required.
Jinfeng, based on what you said, it sounds as if we don't apply
partitioning unles
Thanks guys this is very helpful.
I now need to go away and do some more research into this.
Cheers -- Chris
Sent from my iPhone
> On 21 Oct 2015, at 21:32, Jinfeng Ni wrote:
>
> For each column in the parquet files, Drill will check column metadata
> and see if min == ma
For each column in the parquet files, Drill will check column metadata
and see if min == max across all parquet files. If yes, that indicates
this column has a unique value for all the files, and Drill will use
that column as partitioning columns.
The partitioning column could be a column specifie
Chris,
Its not just sufficient to specify which column is the partition column.
The data should also be organized accordingly. Below is a high level
description of how partition pruning works with parquet files
1. Use CTAS with partition by clause : Here drill creates a single (or
more) file for
We create a JSON format schema for the Parquet file using the Avro
specification and use this schema when loading data.
Is there anything special we have to do to flag a column as a partitioning
column ?
Sorry I don’t understand your answer. What do you mean by ‘discover the columns
with a sing
The information is stored in the footer of the parquet files. Drill
reads the metadata information stored in the parquet footer to discover
the columns with a single value and treats them as partitioning columns.
Thanks
Mehant
On 10/21/15 11:52 AM, Chris Mathews wrote:
Thank Mehant; yes we di
Thank Mehant; yes we did look at doing this, but the advantages of using the
new PARTITION BY feature is that the partitioned columns are automatically
detected during any subsequent queries. This is a major advantage as our
customers are using the Tableau BI tool, and knowing details such as t
In addition to the auto partitioning done by CTAS, Drill also supports
directory based pruning. You could load data into different(nested)
directories underneath the top level table location and use the 'where'
clause to get the pruning performance benefits. Following is a typical
example
Tab
We have an existing ETL framework processing machine generated data, which we
are updating to write Parquet files out directly to HDFS using
AvroParquetWriter for access by Drill.
Some questions:
How do we take advantage of Drill’s partition pruning capabilities with
PARTITION BY if we are not
12 matches
Mail list logo