[ https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241661#comment-15241661 ]
Khurram Faraaz commented on DRILL-4589: --------------------------------------- The following tests will be executed to verify this change. {noformat} There are 25 directories (1990 THROUGH 2015), and each directory has 4 sub directories (Q1, Q2, Q3 and Q4) and each of those sub directories has 2000 parquet files (each being ~2KB in size) REFRESH TABLE METADATA `DRILL_4589` will be executed over the root directory and tests similar to those listed below (and more) will be executed. explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 IS NOT NULL; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 IS NULL; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 25 AND c1 <= 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 53; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 <= 97; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 >= 25 AND c1 < 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 > 25 AND c1 <= 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c1 > 25 AND c1 < 135; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 LIKE 'orb%'; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 LIKE 'orb%' AND c7 = '1958-04-24'; explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND c4 IN (...) explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND LENGTH(c5) >= 1 AND LENGTH(c5) <= 172; {noformat} > Reduce planning time for file system partition pruning by reducing filter > evaluation overhead > --------------------------------------------------------------------------------------------- > > Key: DRILL-4589 > URL: https://issues.apache.org/jira/browse/DRILL-4589 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > > When Drill is used to query hundreds of thousands, or even millions of files > organized into multi-level directories, user typically will provide a > partition filter like : dir0 = something and dir1 = something2 and .. . > For such queries, we saw the query planning time could be unacceptable long, > due to three main overheads: 1) to expand and get the list of files, 2) to > evaluate the partition filter, 3) to get the metadata, in the case of parquet > files for which metadata cache file is not available. > DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after > DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the > partition filter evaluation is applied to file level. In many cases, we saw > that the number of leaf subdirectories is significantly lower than that of > files. Since all the files under the same leaf subdirecctory share the same > directory metadata, we should apply the filter evaluation at the leaf > subdirectory. By doing that, we could reduce the cpu overhead to evaluate the > filter, and the memory overhead as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)