Hi Paul,

Please find the details below for which I have tried Partition Pruning.
Structure of the directory is -> /<dir0>/<dir1>/<dir2>
The number of directories at dir1 level is 21. dir2 directories are
unevenly placed under dir1. Total number of directories of dir2 is 87.
There are 2 Parquet files in each leaf folder. There are 2 Parquet files
from one leaf folder in my query.

We are using Parquet files in HDFS storage.

The parquet files are generated from Python. Is it mandate to re-write the
parquet from other tools (here Python) to take the advantage of pruning ?

Thanks in advance,
Sreeparna Bhabani



On Tue, May 5, 2020 at 1:26 PM Paul Rogers <[email protected]> wrote:

> Hi Sreeparna,
>
>
> There are various reasons that planning might be slow. You mentioned you
> have a partitioned directory structure, which is a good approach. How many
> directories exist at each level? How many files in the leaf folders? How
> many of those folders are included in your query?
>
>
> If the number is large, then the delay may be due to the fact that Drill
> must walk the tree to identify which files to include in the query.
>
>
> Also, which file system are you using? HDFS? S3? Each has different
> characteristics when doing directory operations. (S3 has no actual
> directories, for example.)
>
>
> Please provide the additional information so we can identify the source of
> the issue.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Monday, May 4, 2020, 9:01:08 AM PDT, sreeparna bhabani <
> [email protected]> wrote:
>
>
> Hi Team,
>
> Kindly check the below query regarding the partition pruning. We are using
> the partition pruning for our current project in Apache Drill and have some
> questions. Please find the below details of the scenario-
>
> File Type-
> Parquet generated from Python
>
> Folder structure in hdfs-
> /<root_folder>/<dir0>/<dir1>/<dir2>
>
> Query used to select data under <dir2>-
> To take advantage of partition pruning
> select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 =
> <dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..;
>
> Observation-
> Although the execution is fast, the time taken for planning is quite high.
> I didn't see VALUES operator in the physical plan of the query, rather
> there was SCAN operator.
> How can we ensure that the selected data is partition pruned here ?
> As an alternative, I modified the query to bring down the planning time of
> it and included the sub-directories in the root directory. The modified
> query is-
> select column1, column2, ... from
> dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>`  where <filter> = ..;
>
> Can you please tell me why the planning time is so high for the first
> query? How can we take advantage of partition pruning from it ? Or should
> we include sub-directories in the root directory ?
>
> Thanks in advance.
>
> *Sreeparna Bhabani*
>


-- 

Thanks n Regards,
*Sreeparna Bhabani*

Reply via email to