[ 
https://issues.apache.org/jira/browse/DRILL-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15230778#comment-15230778
 ] 

ASF GitHub Bot commented on DRILL-4589:
---------------------------------------

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/468#discussion_r58921685
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/FileSystemPartitionDescriptor.java
 ---
    @@ -148,13 +139,41 @@ public String getName(int index) {
         return partitionLabel + index;
       }
     
    -  private String getBaseTableLocation() {
    +  protected String getBaseTableLocation() {
         final FormatSelection origSelection = (FormatSelection) 
table.getSelection();
         return origSelection.getSelection().selectionRoot;
       }
     
       @Override
       protected void createPartitionSublists() {
    +    final Collection<String> fileLocations = getFileLocations();
    +    List<PartitionLocation> locations = new LinkedList<>();
    +
    +    final String selectionRoot = getBaseTableLocation();
    +
    +    HashMap<List<String>, List<PartitionLocation>> dirToFileMap = new 
HashMap<>();
    --- End diff --
    
    Can you add a comment here with an example <key, value> pair ?  


> Reduce planning time for file system partition pruning by reducing filter 
> evaluation overhead
> ---------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4589
>                 URL: https://issues.apache.org/jira/browse/DRILL-4589
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> When Drill is used to query hundreds of thousands, or even millions of files 
> organized into multi-level directories, user typically will provide a 
> partition filter like  : dir0 = something and dir1 = something2 and .. .  
> For such queries, we saw the query planning time could be unacceptable long, 
> due to three main overheads: 1) to expand and get the list of files, 2) to 
> evaluate the partition filter, 3) to get the metadata, in the case of parquet 
> files for which metadata cache file is not available. 
> DRILL-2517 targets at the 3rd part of overhead. As a follow-up work after 
> DRILL-2517, we plan to reduce the filter evaluation overhead. For now, the 
> partition filter evaluation is applied to file level. In many cases, we saw 
> that the number of leaf subdirectories is significantly lower than that of 
> files. Since all the files under the same leaf subdirecctory share the same 
> directory metadata, we should apply the filter evaluation at the leaf 
> subdirectory. By doing that, we could reduce the cpu overhead to evaluate the 
> filter, and the memory overhead as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to