[ https://issues.apache.org/jira/browse/DRILL-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176587#comment-16176587 ]
ASF GitHub Bot commented on DRILL-1162: --------------------------------------- Github user amansinha100 commented on a diff in the pull request: https://github.com/apache/drill/pull/905#discussion_r140417433 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdMaxRowCount.java --- @@ -0,0 +1,71 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to you under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ +package org.apache.drill.exec.planner.cost; + +import org.apache.calcite.plan.volcano.RelSubset; +import org.apache.calcite.rel.SingleRel; +import org.apache.calcite.rel.core.TableScan; +import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider; +import org.apache.calcite.rel.metadata.RelMdMaxRowCount; +import org.apache.calcite.rel.metadata.RelMetadataProvider; +import org.apache.calcite.rel.metadata.RelMetadataQuery; +import org.apache.calcite.util.BuiltInMethod; +import org.apache.drill.exec.planner.physical.AbstractPrel; +import org.apache.drill.exec.planner.physical.ScanPrel; + +/** + * DrillRelMdMaxRowCount supplies a specific implementation of + * {@link RelMetadataQuery#getMaxRowCount} for Drill. + */ +public class DrillRelMdMaxRowCount extends RelMdMaxRowCount { + + private static final DrillRelMdMaxRowCount INSTANCE = new DrillRelMdMaxRowCount(); + + public static final RelMetadataProvider SOURCE = ReflectiveRelMetadataProvider.reflectiveSource(BuiltInMethod.MAX_ROW_COUNT.method, INSTANCE); + + public Double getMaxRowCount(ScanPrel rel, RelMetadataQuery mq) { + // the actual row count is known so returns its value + return rel.estimateRowCount(mq); --- End diff -- Returning 'estimated' row count means that this is just an estimate, not the actual value which could be higher. Looking at the implementation of estimatedRowCount() for several of the storage/format plugins, there are several that use NO_EXACT_ROW_COUNT. for instance see [1] for the text format plugin. So, I feel overloading getMaxRowCount() to return an estimate may cause problems. If you look at the semantics of getMaxRowCount in Calcite's RelMdMaxRowCount, it is only intended for cases where **_during planning time_** we can guarantee that the max row count will never exceed that value. For example, an Aggregate with no group-by clause or a LIMIT etc. [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyFormatPlugin.java#L186 > 25 way join ended up with OOM > ----------------------------- > > Key: DRILL-1162 > URL: https://issues.apache.org/jira/browse/DRILL-1162 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow, Query Planning & Optimization > Reporter: Rahul Challapalli > Assignee: Volodymyr Vysotskyi > Priority: Critical > Fix For: Future > > Attachments: error.log, oom_error.log > > > git.commit.id.abbrev=e5c2da0 > The below query results in 0 results being returned > {code:sql} > select count(*) from `lineitem1.parquet` a > inner join `part.parquet` j on a.l_partkey = j.p_partkey > inner join `orders.parquet` k on a.l_orderkey = k.o_orderkey > inner join `supplier.parquet` l on a.l_suppkey = l.s_suppkey > inner join `partsupp.parquet` m on j.p_partkey = m.ps_partkey and l.s_suppkey > = m.ps_suppkey > inner join `customer.parquet` n on k.o_custkey = n.c_custkey > inner join `lineitem2.parquet` b on a.l_orderkey = b.l_orderkey > inner join `lineitem2.parquet` c on a.l_partkey = c.l_partkey > inner join `lineitem2.parquet` d on a.l_suppkey = d.l_suppkey > inner join `lineitem2.parquet` e on a.l_extendedprice = e.l_extendedprice > inner join `lineitem2.parquet` f on a.l_comment = f.l_comment > inner join `lineitem2.parquet` g on a.l_shipdate = g.l_shipdate > inner join `lineitem2.parquet` h on a.l_commitdate = h.l_commitdate > inner join `lineitem2.parquet` i on a.l_receiptdate = i.l_receiptdate > inner join `lineitem2.parquet` o on a.l_receiptdate = o.l_receiptdate > inner join `lineitem2.parquet` p on a.l_receiptdate = p.l_receiptdate > inner join `lineitem2.parquet` q on a.l_receiptdate = q.l_receiptdate > inner join `lineitem2.parquet` r on a.l_receiptdate = r.l_receiptdate > inner join `lineitem2.parquet` s on a.l_receiptdate = s.l_receiptdate > inner join `lineitem2.parquet` t on a.l_receiptdate = t.l_receiptdate > inner join `lineitem2.parquet` u on a.l_receiptdate = u.l_receiptdate > inner join `lineitem2.parquet` v on a.l_receiptdate = v.l_receiptdate > inner join `lineitem2.parquet` w on a.l_receiptdate = w.l_receiptdate > inner join `lineitem2.parquet` x on a.l_receiptdate = x.l_receiptdate; > {code} > However when we remove the last 'inner join' and run the query it returns > '716372534'. Since the last inner join is similar to the one's before it, it > should match some records and return the data appropriately. > The logs indicated that it actually returned 0 results. Attached the log file. -- This message was sent by Atlassian JIRA (v6.4.14#64029)