Hi Paul, Please find the details below for which I have tried Partition Pruning. Structure of the directory is -> /<dir0>/<dir1>/<dir2> The number of directories at dir1 level is 21. dir2 directories are unevenly placed under dir1. Total number of directories of dir2 is 87. There are 2 Parquet files in each leaf folder. There are 2 Parquet files from one leaf folder in my query.
We are using Parquet files in HDFS storage. The parquet files are generated from Python. Is it mandate to re-write the parquet from other tools (here Python) to take the advantage of pruning ? Thanks in advance, Sreeparna Bhabani On Tue, May 5, 2020 at 1:26 PM Paul Rogers <[email protected]> wrote: > Hi Sreeparna, > > > There are various reasons that planning might be slow. You mentioned you > have a partitioned directory structure, which is a good approach. How many > directories exist at each level? How many files in the leaf folders? How > many of those folders are included in your query? > > > If the number is large, then the delay may be due to the fact that Drill > must walk the tree to identify which files to include in the query. > > > Also, which file system are you using? HDFS? S3? Each has different > characteristics when doing directory operations. (S3 has no actual > directories, for example.) > > > Please provide the additional information so we can identify the source of > the issue. > > > Thanks, > > - Paul > > > > On Monday, May 4, 2020, 9:01:08 AM PDT, sreeparna bhabani < > [email protected]> wrote: > > > Hi Team, > > Kindly check the below query regarding the partition pruning. We are using > the partition pruning for our current project in Apache Drill and have some > questions. Please find the below details of the scenario- > > File Type- > Parquet generated from Python > > Folder structure in hdfs- > /<root_folder>/<dir0>/<dir1>/<dir2> > > Query used to select data under <dir2>- > To take advantage of partition pruning > select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 = > <dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..; > > Observation- > Although the execution is fast, the time taken for planning is quite high. > I didn't see VALUES operator in the physical plan of the query, rather > there was SCAN operator. > How can we ensure that the selected data is partition pruned here ? > As an alternative, I modified the query to bring down the planning time of > it and included the sub-directories in the root directory. The modified > query is- > select column1, column2, ... from > dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>` where <filter> = ..; > > Can you please tell me why the planning time is so high for the first > query? How can we take advantage of partition pruning from it ? Or should > we include sub-directories in the root directory ? > > Thanks in advance. > > *Sreeparna Bhabani* > -- Thanks n Regards, *Sreeparna Bhabani*
