[GitHub] [drill] paul-rogers commented on a change in pull request #1892: Drill-7437: Storage Plugin for Generic HTTP REST API

GitBox Fri, 29 Nov 2019 17:08:06 -0800

paul-rogers commented on a change in pull request #1892: Drill-7437: Storage 
Plugin for Generic HTTP REST API
URL: https://github.com/apache/drill/pull/1892#discussion_r352254853


 ##########
 File path: 
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/HttpGroupScan.java
 ##########
 @@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.store.http;
+
+import java.util.List;
+import org.apache.drill.common.expression.SchemaPath;
+
+import org.apache.drill.exec.physical.base.AbstractGroupScan;
+import org.apache.drill.exec.physical.base.GroupScan;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.physical.base.ScanStats;
+import org.apache.drill.exec.physical.base.ScanStats.GroupScanProperty;
+import org.apache.drill.exec.physical.base.SubScan;
+import org.apache.drill.exec.proto.CoordinationProtos.DrillbitEndpoint;
+import org.apache.drill.shaded.guava.com.google.common.base.Preconditions;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class HttpGroupScan extends AbstractGroupScan {
+  private static final Logger logger = 
LoggerFactory.getLogger(HttpGroupScan.class);
+
+  private final List<SchemaPath> columns;
+  private final HttpScanSpec httpScanSpec;
+  private final HttpStoragePluginConfig httpStoragePluginConfig;
+  private boolean filterPushedDown = true;
+
+  public HttpGroupScan (
+    HttpStoragePluginConfig config,
+    HttpScanSpec scanSpec,
+    List<SchemaPath> columns
+  ) {
+    super("no-user");
+    this.httpStoragePluginConfig = config;
+    this.httpScanSpec = scanSpec;
+    this.columns = columns == null || columns.size() == 0 ? ALL_COLUMNS : 
columns;
+  }
+
+  public HttpGroupScan(HttpGroupScan that) {
+    super(that);
+    httpStoragePluginConfig = that.getStorageConfig();
+    httpScanSpec = that.getScanSpec();
+    columns = that.getColumns();
+  }
+
+  @Override
+  public void applyAssignments(List<DrillbitEndpoint> endpoints) {
+    logger.debug("HttpGroupScan applyAssignments");
+  }
+
+  @Override
+  public int getMaxParallelizationWidth() {
+    return 1;
+  }
+
+
+  @Override
+  public boolean canPushdownProjects(List<SchemaPath> columns) {
+    return true;
+  }
+
+  @Override
+  public SubScan getSpecificScan(int minorFragmentId) {
+    logger.debug("HttpGroupScan getSpecificScan");
+    return new HttpSubScan(httpStoragePluginConfig, httpScanSpec, columns);
+  }
+
+  @Override
+  public GroupScan clone(List<SchemaPath> columns) {
+    logger.debug("HttpGroupScan clone {}", columns);
+    return new HttpGroupScan(this);
+  }
+
+  @Override
+  public String getDigest() {
+    return toString();
+  }
+
+  @Override
+  public List<SchemaPath> getColumns() {
+    return columns;
+  }
+
+  @Override
+  public PhysicalOperator getNewWithChildren(List<PhysicalOperator> children) {
+    Preconditions.checkArgument(children.isEmpty());
+    return new HttpGroupScan(this);
+  }
+
+  @Override
+  public ScanStats getScanStats() {
 
 Review comment:
   These stats are more important than they look. Seems that you are doing 
filter push down (though I see no members on this class to hold those filters.) 
When you do, you must ensure that the version of your group scan with filters 
has a lower cost than the one without, else Calcite won't actually push down 
the filters. (That is a VERY difficult bug to find!)
   
   Also, the esimated record count here is 1. This means Drill is free to 
"broadcast" the rows to all Drillbits if this table appears in a join. But, if 
the API could return 1M rows, that will turn out to be a very bad choice (Drill 
should have done a partitioned hash join instead.) So, choose a row count that 
is the maximum of what you'd expect. No harm in being high, there is harm in 
being low.
   
   Then, make two adjustments:
   
   1. Once columns are pushed down, reduce the estimated row width (from 200 to 
100, say). That will help Calcite realize that pushing projection is a good 
choice.
   2. Once filters are pushed down, reduce the estimated row count (by half, 
say.) That again will tell Calcite to prefer the version with filter push down.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1892: Drill-7437: Storage Plugin for Generic HTTP REST API

Reply via email to