[ https://issues.apache.org/jira/browse/IMPALA-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned IMPALA-8015: ----------------------------------- Assignee: (was: Paul Rogers) > Incorrect cardinality calculation for the generic case > ------------------------------------------------------ > > Key: IMPALA-8015 > URL: https://issues.apache.org/jira/browse/IMPALA-8015 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Priority: Major > > The expression used to calculate the cardinality for a M:N (“generic”) join > is incorrect. Scroll to the end for the proposed fix. > The standard calculation for joins is explained in Swami & Schiefer, [On the > Estimation of Join Result > Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf] > (S&S). Note especially section 3, Background. Also see [How Good Are Query > Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et > al. > h4. Current Implementation > The code uses the following: > {code:java} > long result = -1; > for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) { > double lhsAdjNdv = slots.lhsNdv(); > if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / > slots.lhsNumRows(); > double rhsAdjNdv = slots.rhsNdv(); > if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / > slots.rhsNumRows(); > long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, > rhsAdjNdv))) * > rhsCard); > if (result == -1) { > result = joinCard; > } else { > result = Math.min(result, joinCard); > } > } > {code} > The above is hard to follow, which is part of the problem. Basically, the > math for each key is: > {noformat} > |L.ki'| = min( |L.ki| * |L'| / |L|, |L.ki| ) > = |L.ki| * min( |L'| / |L|, 1 ) > {noformat} > This says that we reduce the left-hand key by the selectivity of the > left-hand filter, but we always have at least one key. The logic for the > right side is the same. > There is a subtle error: the compound key as a whole is reduced by the > selectivity of the filter. But, we can't reduce each key by that factor or we > end up reducing the compound key by the selectivity to the power of the key > count. This is not a bug here because of the wrong way we compute the overall > join cardinaly shown below. > The last line is correct as it reduces to: > {noformat} > |join| = |L'| * |R'| / max( |L.ki'|, |R.ki'| ) > {noformat} > Where: > * {{L}} is the relation on the left side of the join, {{R}} the right. > * {{L'}} is the result of applying scan filters to table {{L}}. Similarly for > {{R'}}. > * {{|T|}} is the cardinality (number of rows) in table {{T}}. > * {{L.k}} is the (possibly compound) key column for the left table. Similarly > for {{R.k}}. > * {{L.ki}} is the ith column within a compound key. > The code is also wrong in how it combines the keys to get a final answer. The > code produces correct answers for a single key, but incorrect answers for > compound keys. The code says that the final join cardinality is: > {noformat} > |join| = min( |L'| * |R'| / max( |L.ki'|, |R.ki'| ) ) > = |L'| * |R'| / max i=1 ...( max( |L.ki'|, |R.ki'| ) ) > {noformat} > That is, the join cardinality is given by the largest component of the > compound key. The result of this bug that the code estimate will use NDVs > that are too small, producing estimates which are too large, which throws off > join selection and could degrade query performance if it leads to an > inefficient plan. > h4. Proposed Solution > The correct math, worked out in IMPALA-8014, is: > {noformat} > ∏ |T.ki| > |T.k'| = |T'| * min( ----------, 1 ) > |T| > |L’| * |R’| > |L’ ⋈ R’| = ----------------------- > max( |L.k'|, |R.k'| ) > {noformat} > The above says that the key cardinality determined by the cardinality of the > input relation after filtering. This is either used as is (if the compound > key cardinality is greater than the table cardinality), or scaled by the > ratio of compound key cardinality to table cardinality. > Notice the subtle differences between the proposed and current solutions. > The solution can be improved. > * IMPALA-8018 observes that the resulting code is the same for the "FK/PK" > and "generic" cases and proposes to unify the two. > * IMPALA-8213 proposes a solution when expressions are correlated between the > left and right hand inputs. > * IMPALA-8218 says we should use a "simple urn model" to calculate the key > estimate, not the simple linear model used above and in the code. > * IMPALA-XXXX notes the complexity of computing the filtered table > cardinality {{T'}} on the left side when the left side is join. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org