[jira] [Commented] (DRILL-4530) Improve metadata cache performance for queries with single partition

ASF GitHub Bot (JIRA) Fri, 17 Jun 2016 13:08:33 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336826#comment-15336826
 ]


ASF GitHub Bot commented on DRILL-4530:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/519#discussion_r67567993
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/partition/PruneScanRule.java
 ---
    @@ -269,13 +283,54 @@ protected void doOnMatch(RelOptRuleCall call, Filter 
filterRel, Project projectR
             int recordCount = 0;
             int qualifiedCount = 0;
     
    -        // Inner loop: within each batch iterate over the 
PartitionLocations
    -        for(PartitionLocation part: partitions){
    -          if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1){
    -            newPartitions.add(part);
    -            qualifiedCount++;
    +        if (checkForSingle &&
    +            partitions.get(0).isCompositePartition() /* apply single 
partition check only for composite partitions */) {
    +          // Inner loop: within each batch iterate over the 
PartitionLocations
    +          for (PartitionLocation part : partitions) {
    +            assert part.isCompositePartition();
    +            if(!output.getAccessor().isNull(recordCount) && 
output.getAccessor().get(recordCount) == 1) {
    +              newPartitions.add(part);
    +              if (isSinglePartition) { // only need to do this if we are 
already single partition
    +                // compose the array of partition values for the 
directories that are referenced by filter:
    +                // e.g suppose the dir hierarchy is year/quarter/month and 
the query is:
    +                //     SELECT * FROM T WHERE dir0=2015 AND dir1 = 'Q1',
    +                // then for 2015/Q1/Feb, this will have ['2015', 'Q1', 
null]
    +                // Note that we are not using the PartitionLocation here 
but composing a different list because
    +                // we are only interested in the directory columns that 
are referenced in the filter condition. not
    +                // the SELECT list or other parts of the query.
    +                Pair<String[], Integer> p = 
composePartition(referencedDirsBitSet, partitionMap, vectors, recordCount);
    +                String[] parts = p.getLeft();
    +                int tmpIndex = p.getRight();
    +                if (spInfo == null) {
    +                  spInfo = parts;
    +                  maxIndex = tmpIndex;
    +                } else if (maxIndex != tmpIndex) {
    +                  isSinglePartition = false;
    +                  break;
    +                } else {
    +                  // we only want to compare until the maxIndex inclusive 
since subsequent values would be null
    +                  for (int j = 0; j <= maxIndex; j++) {
    +                    if (spInfo[j] == null // prefixes should be non-null
    --- End diff --
    
    Form Line 305-306, spInfo and maxIndex are in sync. Why will we have 
spInfo[j] == null, when j <= maxIndex? I thought maxIndex is obtained such that 
element in spInfo is not null.


> Improve metadata cache performance for queries with single partition 
> ---------------------------------------------------------------------
>
>                 Key: DRILL-4530
>                 URL: https://issues.apache.org/jira/browse/DRILL-4530
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Query Planning & Optimization
>    Affects Versions: 1.6.0
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>             Fix For: 1.7.0
>
>
> Consider two types of queries which are run with Parquet metadata caching: 
> {noformat}
> query 1:
> SELECT col FROM  `A/B/C`;
> query 2:
> SELECT col FROM `A` WHERE dir0 = 'B' AND dir1 = 'C';
> {noformat}
> For a certain dataset, the query1 elapsed time is 1 sec whereas query2 
> elapsed time is 9 sec even though both are accessing the same amount of data. 
>  The user expectation is that they should perform roughly the same.  The main 
> difference comes from reading the bigger metadata cache file at the root 
> level 'A' for query2 and then applying the partitioning filter.  query1 reads 
> a much smaller metadata cache file at the subdirectory level. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4530) Improve metadata cache performance for queries with single partition

Reply via email to