[
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jinfeng Ni resolved DRILL-4589.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.7.0
Fixed in commit: dbf4b15eda14f55462ff0872266bf61c13bdb1bc
> Reduce planning time for file system partition pruning by reducing filter
> evaluation overhead
> ---------------------------------------------------------------------------------------------
>
> Key: DRILL-4589
> URL: https://issues.apache.org/jira/browse/DRILL-4589
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
> Fix For: 1.7.0
>
>
> When Drill is used to query hundreds of thousands, or even millions of files
> organized into multi-level directories, user typically will provide a
> partition filter like : dir0 = something and dir1 = something2 and .. .
> For such queries, we saw the query planning time could be unacceptable long,
> due to three main overheads: 1) to expand and get the list of files, 2) to
> evaluate the partition filter, 3) to get the metadata, in the case of parquet
> files for which metadata cache file is not available.
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the
> partition filter evaluation is applied to file level. In many cases, we saw
> that the number of leaf subdirectories is significantly lower than that of
> files. Since all the files under the same leaf subdirecctory share the same
> directory metadata, we should apply the filter evaluation at the leaf
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the
> filter, and the memory overhead as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)