[jira] [Commented] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query

ASF GitHub Bot (JIRA) Wed, 10 Feb 2016 06:43:35 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140894#comment-15140894
 ]


ASF GitHub Bot commented on DRILL-4363:
---------------------------------------

Github user jacques-n commented on a diff in the pull request:

    https://github.com/apache/drill/pull/371#discussion_r52465362
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
    @@ -0,0 +1,107 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.drill.exec.planner.logical;
    +
    +import com.google.common.collect.ImmutableList;
    +import org.apache.calcite.plan.RelOptRule;
    +import org.apache.calcite.plan.RelOptRuleCall;
    +import org.apache.calcite.plan.RelOptRuleOperand;
    +import org.apache.calcite.rel.RelNode;
    +import org.apache.calcite.util.Pair;
    +import org.apache.drill.exec.physical.base.GroupScan;
    +import org.apache.drill.exec.planner.logical.partition.PruneScanRule;
    +import org.apache.drill.exec.store.parquet.ParquetGroupScan;
    +
    +import java.io.IOException;
    +import java.util.concurrent.TimeUnit;
    +
    +public abstract class DrillPushLimitToScanRule extends RelOptRule {
    +  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(DrillPushLimitToScanRule.class);
    +
    +  private DrillPushLimitToScanRule(RelOptRuleOperand operand, String 
description) {
    +    super(operand, description);
    +  }
    +
    +  public static DrillPushLimitToScanRule LIMIT_ON_SCAN = new 
DrillPushLimitToScanRule(
    +      RelOptHelper.some(DrillLimitRel.class, 
RelOptHelper.any(DrillScanRel.class)), "DrillPushLimitToScanRule_LimitOnScan") {
    +    @Override
    +    public boolean matches(RelOptRuleCall call) {
    +      DrillScanRel scanRel = call.rel(1);
    +      return scanRel.getGroupScan() instanceof ParquetGroupScan; // It 
only applies to Parquet.
    +    }
    +
    +    @Override
    +    public void onMatch(RelOptRuleCall call) {
    +        DrillLimitRel limitRel = call.rel(0);
    +        DrillScanRel scanRel = call.rel(1);
    +        doOnMatch(call, limitRel, scanRel, null);
    +    }
    +  };
    +
    +  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT = new 
DrillPushLimitToScanRule(
    +      RelOptHelper.some(DrillLimitRel.class, 
RelOptHelper.some(DrillProjectRel.class, 
RelOptHelper.any(DrillScanRel.class))), 
"DrillPushLimitToScanRule_LimitOnProject") {
    +    @Override
    +    public boolean matches(RelOptRuleCall call) {
    +      DrillScanRel scanRel = call.rel(2);
    +      return scanRel.getGroupScan() instanceof ParquetGroupScan; // It 
only applies to Parquet.
    +    }
    +
    +    @Override
    +    public void onMatch(RelOptRuleCall call) {
    +      DrillLimitRel limitRel = call.rel(0);
    +      DrillProjectRel projectRel = call.rel(1);
    +      DrillScanRel scanRel = call.rel(2);
    +      doOnMatch(call, limitRel, scanRel, projectRel);
    +    }
    +  };
    +
    +
    +  protected void doOnMatch(RelOptRuleCall call, DrillLimitRel limitRel, 
DrillScanRel scanRel, DrillProjectRel projectRel){
    +    try {
    +      final int rowCountRequested = (int) limitRel.getRows();
    +
    +      final Pair<GroupScan, Boolean>  newGroupScanPair = 
ParquetGroupScan.filterParquetScanByLimit((ParquetGroupScan)(scanRel.getGroupScan()),
 rowCountRequested);
    --- End diff --
    
    How about:
    
    boolean applyLimit(int maxRecords)
    
    Returns whether the limit was applied. Default implementation in 
AbstractGroupScan is return false.


> Apply row count based pruning for parquet table in LIMIT n query
> ----------------------------------------------------------------
>
>                 Key: DRILL-4363
>                 URL: https://issues.apache.org/jira/browse/DRILL-4363
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Jinfeng Ni
>            Assignee: Aman Sinha
>             Fix For: 1.6.0
>
>
> In interactive data exploration use case, one common and probably first query 
> that users would use is " SELECT * from table LIMIT n", where n is a small 
> number. Such query will give user idea about the columns in the table.
> Normally, user would expect such query should be completed in very short 
> time, since it's just asking for small amount of rows, without any 
> sort/aggregation.
> When table is small, there is no big problem for Drill. However, when the 
> table is extremely large,  Drill's response time is not as fast as what user 
> would expect.
> In case of parquet table, it seems that query planner could do a bit better 
> job : by applying row count based pruning for such LIMIT n query.  The 
> pruning is kind of similar to what partition pruning will do, except that it 
> uses row count, in stead of partition column values. Since row count is 
> available in parquet table, it's possible to do such pruning.
> The benefit of doing such pruning is clear: 1) for small "n",  such pruning 
> would end up with a few parquet files, in stead of thousands, or millions of 
> files to scan. 2) execution probably does not have to put scan into multiple 
> minor fragments and start reading the files concurrently, which will cause 
> big IO overhead. 3) the physical plan itself is much smaller, since it does 
> not include the long list of parquet files, reduce rpc cost of sending the 
> fragment plans to multiple drillbits, and the overhead to 
> serialize/deserialize the fragment plans.
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query

Reply via email to