[ 
https://issues.apache.org/jira/browse/IMPALA-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8015:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Incorrect cardinality calculation for the generic case
> ------------------------------------------------------
>
>                 Key: IMPALA-8015
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8015
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> The expression used to calculate the cardinality for a M:N (“generic”) join 
> is incorrect. Scroll to the end for the proposed fix.
> The standard calculation for joins is explained in Swami & Schiefer, [On the 
> Estimation of Join Result 
> Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
>  (S&S). Note especially section 3, Background. Also see [How Good Are Query 
> Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et 
> al.
> h4. Current Implementation
> The code uses the following:
> {code:java}
>     long result = -1;
>     for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) {
>       double lhsAdjNdv = slots.lhsNdv();
>       if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / 
> slots.lhsNumRows();
>       double rhsAdjNdv = slots.rhsNdv();
>       if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / 
> slots.rhsNumRows();
>       long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, 
> rhsAdjNdv))) *
>           rhsCard);
>       if (result == -1) {
>         result = joinCard;
>       } else {
>         result = Math.min(result, joinCard);
>       }
>   }
> {code}
> The above is hard to follow, which is part of the problem. Basically, the 
> math for each key is:
> {noformat}
> |L.ki'| = min( |L.ki| * |L'| / |L|, |L.ki| )
>         = |L.ki| * min( |L'| / |L|, 1 )
> {noformat}
> This says that we reduce the left-hand key by the selectivity of the 
> left-hand filter, but we always have at least one key. The logic for the 
> right side is the same.
> There is a subtle error: the compound key as a whole is reduced by the 
> selectivity of the filter. But, we can't reduce each key by that factor or we 
> end up reducing the compound key by the selectivity to the power of the key 
> count. This is not a bug here because of the wrong way we compute the overall 
> join cardinaly shown below.
> The last line is correct as it reduces to:
> {noformat}
> |join| = |L'| * |R'| / max( |L.ki'|, |R.ki'| )
> {noformat}
> Where:
> * {{L}} is the relation on the left side of the join, {{R}} the right.
> * {{L'}} is the result of applying scan filters to table {{L}}. Similarly for 
> {{R'}}.
> * {{|T|}} is the cardinality (number of rows) in table {{T}}.
> * {{L.k}} is the (possibly compound) key column for the left table. Similarly 
> for {{R.k}}.
> * {{L.ki}} is the ith column within a compound key.
> The code is also wrong in how it combines the keys to get a final answer. The 
> code produces correct answers for a single key, but incorrect answers for 
> compound keys. The code says that the final join cardinality is:
> {noformat}
> |join| = min( |L'| * |R'| / max( |L.ki'|, |R.ki'| ) )
>        = |L'| * |R'| / max i=1 ...( max( |L.ki'|, |R.ki'| ) )
> {noformat}
> That is, the join cardinality is given by the largest component of the 
> compound key. The result of this bug that the code estimate will use NDVs 
> that are too small, producing estimates which are too large, which throws off 
> join selection and could degrade query performance if it leads to an 
> inefficient plan.
> h4. Proposed Solution
> The correct math, worked out in IMPALA-8014, is:
> {noformat}
>                       ∏ |T.ki|
> |T.k'| = |T'| * min( ----------, 1 )
>                         |T|
>                   |L’| * |R’|
> |L’ ⋈ R’| = -----------------------
>              max( |L.k'|, |R.k'| )
> {noformat}
> The above says that the key cardinality determined by the cardinality of the 
> input relation after filtering. This is either used as is (if the compound 
> key cardinality is greater than the table cardinality), or scaled by the 
> ratio of compound key cardinality to table cardinality.
> Notice the subtle differences between the proposed and current solutions.
> The solution can be improved.
> * IMPALA-8018 observes that the resulting code is the same for the "FK/PK" 
> and "generic" cases and proposes to unify the two.
> * IMPALA-8213 proposes a solution when expressions are correlated between the 
> left and right hand inputs.
> * IMPALA-8218 says we should use a "simple urn model" to calculate the key 
> estimate, not the simple linear model used above and in the code.
> * IMPALA-XXXX notes the complexity of computing the filtered table 
> cardinality {{T'}} on the left side when the left side is join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to