[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241661#comment-15241661
 ] 

Khurram Faraaz commented on DRILL-4589:
---------------------------------------

The following tests will be executed to verify this change.

{noformat}
There are 25 directories (1990 THROUGH 2015), and each directory has 4 sub 
directories (Q1, Q2, Q3 and Q4)
and each of those sub directories has 2000 parquet files (each being ~2KB in 
size)

REFRESH TABLE METADATA `DRILL_4589`
will be executed over the root directory and tests similar to those listed 
below (and more) will be executed.

explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 IS NOT NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 IS NULL;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 53;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 <= 97;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 >= 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 > 25 AND c1 <= 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c1 > 25 AND c1 < 135;
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 LIKE 'orb%';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 LIKE 'orb%' AND c7 = '1958-04-24';
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
c4 IN (...)
explain plan for select * from `DRILL_4589` WHERE dir0='2000' AND dir1='Q2' AND 
LENGTH(c5) >= 1 AND LENGTH(c5) <= 172;
{noformat}

> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> ---------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4589
>                 URL: https://issues.apache.org/jira/browse/DRILL-4589
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to