[jira] [Commented] (DRILL-7121) TPCH 4 takes longer when Statistics is disabled.

ASF GitHub Bot (JIRA) Thu, 28 Mar 2019 08:20:06 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804029#comment-16804029
 ]


ASF GitHub Bot commented on DRILL-7121:
---------------------------------------

amansinha100 commented on pull request #1718: DRILL-7121: Use correct ndv when 
statistics is disabled
URL: https://github.com/apache/drill/pull/1718#discussion_r270054867
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdDistinctRowCount.java
 ##########
 @@ -75,8 +75,25 @@ public Double getDistinctRowCount(Join rel, 
RelMetadataQuery mq,
 
   @Override
   public Double getDistinctRowCount(RelNode rel, RelMetadataQuery mq, 
ImmutableBitSet groupKey, RexNode predicate) {
-    if (rel instanceof TableScan && !DrillRelOptUtil.guessRows(rel)) {
-      return getDistinctRowCount((TableScan) rel, mq, groupKey, predicate);
+    if (rel instanceof DrillScanRelBase) {
+      DrillTable table = rel.getTable().unwrap(DrillTable.class);
+      if (table == null) {
+        if (rel.getTable().unwrap(DrillTranslatableTable.class) != null) {
+          table = 
rel.getTable().unwrap(DrillTranslatableTable.class).getDrillTable();
+        }
+      }
+      if (table != null && table.getStatsTable() != null && 
!DrillRelOptUtil.guessRows(rel)) {
+        return getDistinctRowCount(((DrillScanRelBase)rel), mq, table, 
groupKey, rel.getRowType(), predicate);
+      } else {
+        // If guessing, return NDV as 0.1 * rowCount
+        /* If there is no table or metadata (stats) table associated with 
scan, estimate the
+         * distinct row count. Consistent with the estimation of Aggregate row 
count in
+         * RelMdRowCount: distinctRowCount = rowCount * 10%.
+         */
+        if (rel instanceof DrillScanRel) {
 
 Review comment:
   It would be good to add some comment here why the earlier check is for 
DrillScanRelBase and this one is for DrillScanRel. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> TPCH 4 takes longer when Statistics is disabled.
> ------------------------------------------------
>
>                 Key: DRILL-7121
>                 URL: https://issues.apache.org/jira/browse/DRILL-7121
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning &amp; Optimization
>    Affects Versions: 1.16.0
>            Reporter: Robert Hou
>            Assignee: Gautam Parai
>            Priority: Blocker
>             Fix For: 1.16.0
>
>
> Here is TPCH 4 with sf 100:
> {noformat}
> select
>   o.o_orderpriority,
>   count(*) as order_count
> from
>   orders o
> where
>   o.o_orderdate >= date '1996-10-01'
>   and o.o_orderdate < date '1996-10-01' + interval '3' month
>   and 
>   exists (
>     select
>       *
>     from
>       lineitem l
>     where
>       l.l_orderkey = o.o_orderkey
>       and l.l_commitdate < l.l_receiptdate
>   )
> group by
>   o.o_orderpriority
> order by
>   o.o_orderpriority;
> {noformat}
> The plan has changed when Statistics is disabled.   A Hash Agg and a 
> Broadcast Exchange have been added.  These two operators expand the number of 
> rows from the lineitem table from 137M to 9B rows.   This forces the hash 
> join to use 6GB of memory instead of 30 MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-7121) TPCH 4 takes longer when Statistics is disabled.

Reply via email to