[jira] [Commented] (DRILL-4786) Improve metadata cache performance for queries with multiple partitions

ASF GitHub Bot (JIRA) Tue, 26 Jul 2016 16:00:36 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15394724#comment-15394724
 ]


ASF GitHub Bot commented on DRILL-4786:
---------------------------------------

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/553#discussion_r72353863
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
    @@ -387,16 +378,35 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
           condition = condition.accept(reverseVisitor);
           pruneCondition = pruneCondition.accept(reverseVisitor);
     
    -      if (checkForSingle && isSinglePartition && !wasAllPartitionsPruned) {
    +      if (descriptor.supportsMetadataCachePruning() && 
!wasAllPartitionsPruned) {
             // if metadata cache file could potentially be used, then assign a 
proper cacheFileRoot
    -        String path = "";
    -        for (int j = 0; j <= maxIndex; j++) {
    -          path += "/" + spInfo[j];
    +        int index = -1;
    +        if (!matchBitSet.isEmpty()) {
    +          String path = "";
    +          index = matchBitSet.length() - 1;
    +
    +          for (int j = 0; j < matchBitSet.length(); j++) {
    +            if (!matchBitSet.get(j)) {
    +              // stop at the first index with no match and use the 
immediate
    +              // previous index
    +              index = j-1;
    +              break;
    +            }
    +          }
    +          for (int j=0; j <= index; j++) {
    +            path += "/" + spInfo[j];
    +          }
    +          cacheFileRoot = descriptor.getBaseTableLocation() + path;
    --- End diff --
    
    Actually, cacheFileRoot can be null and ParquetGroupScan will default to 
using the selectionRoot as the location of the cache file..here's where it 
decides what to use: 
    
https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L221



> Improve metadata cache performance for queries with multiple partitions
> -----------------------------------------------------------------------
>
>                 Key: DRILL-4786
>                 URL: https://issues.apache.org/jira/browse/DRILL-4786
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Metadata, Query Planning & Optimization
>    Affects Versions: 1.7.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>
> Consider  queries of the following type run against Parquet data with 
> metadata caching:   
> {noformat}
> SELECT col FROM `A` WHERE dir0 = 'B`' AND dir1 IN ('1', '2', '3')
> {noformat}
> For such queries, Drill will read the metadata cache file from the top level 
> directory 'A', which is not very efficient since we are only interested in 
> the files  from some subdirectories of 'B'.   DRILL-4530 improves the 
> performance of such queries when the leaf level directory is a single 
> partition.  Here, there are 3 subpartitions due to the IN list.   We can 
> build upon the DRILL-4530 enhancement by at least reading the cache file from 
> the immediate parent level  `/A/B`  instead of the top level.  
> The goal of this JIRA is to improve performance for such types of queries.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4786) Improve metadata cache performance for queries with multiple partitions

Reply via email to