[ https://issues.apache.org/jira/browse/IMPALA-8015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730053#comment-16730053 ]
Paul Rogers commented on IMPALA-8015: ------------------------------------- Here's an example, also from {{PlannerTest.testJoins()}}, that estimates a join incorrectly on the small side: 53 instead of 11K: {noformat} # hbase-hdfs join select * from functional.alltypesagg join functional_hbase.alltypessmall using (id, int_col) ---- PLAN PLAN-ROOT SINK | 02:HASH JOIN [INNER JOIN] | hash predicates: functional.alltypesagg.id = functional_hbase.alltypessmall.id, functional.alltypesagg.int_col = functional_hbase.alltypessmall.int_col | runtime filters: RF000 <- functional_hbase.alltypessmall.id, RF001 <- functional_hbase.alltypessmall.int_col | row-size=184B cardinality=53 | |--01:SCAN HBASE [functional_hbase.alltypessmall] | row-size=89B cardinality=50 | 00:SCAN HDFS [functional.alltypesagg] partitions=11/11 files=11 size=814.73KB row-size=95B cardinality=11.00K runtime filters: RF000 -> functional.alltypesagg.id, RF001 -> functional.alltypesagg.int_col {noformat} We read 11K rows on the probe side, joined with 53 rows on the build side. The id column is unique (a PK), adding the int_col makes it "more unique." So, each of the 11K probe rows joins with (at most) one row on the build side. The correct output cardinality is 11K, not 53. > Incorrect cardinality calculation for the generic case > ------------------------------------------------------ > > Key: IMPALA-8015 > URL: https://issues.apache.org/jira/browse/IMPALA-8015 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Major > > Please see the background information in IMPALA-8014, including notation. The > material here is a version of that in [these > note|https://github.com/paul-rogers/impala/wiki/Planner]. > h4. Background > Impala uses two distinct ways to estimate join cardinality: FK/PK and the > "generic" case. Both are in {{JoinNode.getJoinCardinality()}}. > The generic case handles the M:N case, multiple rows in the left table join > with multiple rows in the right table. The {{getGenericJoinCardinality()}} > works out the estimated cardinality. > h4. Deriving the Cardinality > Assume a join of two tables, left ({{l}}) and right ({{r}}), with no > predicate. We have a Cartesian product: > {noformat} > |l >< r| = |l| * |r| > {noformat} > Suppose we have an equi-join predicate: {{l.a = r.b}} and that we have the > NDV values for both columns. > Lets assume the left is the probe side. How many rows will it match? It will > match all the rows with a given value, or {{|r 𝜎 b=x|}}, (which we'll > abbreviate to {{|r.b=x|}} which is: > {noformat} > |r.b=x| = |r| / ndv(r.b) > {noformat} > Intuitively, the {{r.b}} keys are uniformly distributed, each left-hand row > matches a group of right-hand rows the size of which is given by the equation > above. If all values in r are unique, then {{|r| / ndv(r.b)}} is one and we > match one row. If all values are the same, {{|r| / ndv(r.b) = |r|}}, so we > match all rows (or none). > The total number of output rows, to a first approximation, is: > {noformat} > |l >< r| = |l| * |r| / ndv(r.b) > {noformat} > Next, how many rows will appear on the left (probe) side? It is the output of > the previous join, or the scan of the table in that node, thus {{scan(l)}}. > How many rows on the right? Impala requires that the right side be a table, > so the number is {{|scan(r)|}}. (Actually, either side can be a table scan, > but the reasoning is symmetrical.) If we randomly sample the {{r}} table to > get {{scan(r)}}, then the probability of getting any one row from r is: > {noformat} > p(scan(r) : row = x) = |scan(r)| / |r| > {noformat} > (Read this as the probability or a row x appearing in the scan result of r.) > Let's check the math. Using the above, the total number of scanned rows is: > {noformat} > |scan(r)| = |r| * p(scan(r) : row = x) > = |r| * |scan(r)| / |r| > = |scan(r)| > {noformat} > The total number of keys in each group in the scanned result is the group > size adjusted for the probability that any one group member appears: > {noformat} > |scan(r.b)| = |r| / ndv(r.b) * p(r.b) > = |r| / ndv(r.b) * |scan(r)| / |r| > = |scan(r)| / ndv(r.b) > {noformat} > Intuitively, however may rows are scanned, they are still divided into the > same set of groups. > Next we can work out the total join cardinality given the number of rows on > the left, and the number of matching rows on the right: > {noformat} > |l >< r| = |scan(l)| * |scan(r.b)| > = |scan(l)| * |scan(r)| / ndv(r.b) > {noformat} > The above assumes that all rows from l match rows in r. But, if there are > more key values in l than r, some can't match, and visa-versa. What is the > probability that a row from l will find a match? We can reason it out like > this: > * If there are fewer values in l.a than in r.b, we can assume all will match. > * If there are more values in l.a than in r.b, we can assume we'll match all > values in l.a, then discard the extra values that don't match. > The details are already worked out in IMPALA-8014. The probability is: > {noformat} > p(match) = ndv(l.a) / ndv(r.b) if ndv(l.a) < ndv(r.b), > ndv(r.b) / ndv(l.a) otherwise. > {noformat} > Intuitively, if there are more l rows than r, then a match is the ratio > between them and visa-versa. > Let's check. If the ndv's are equal, the probability of a match is 1. If > either table is empty, the probability is 0. If ndv(l.a) is half that of > ndv(l.b) the probability is 0.5 and visa versa. All good. > Putting it all together: > {noformat} > |l >< r| = |scan(l)| * |scan(r)| / ndv(r.b) * p(match) > {noformat} > For the case that {{ndv(l.a) > ndv(r.b)}}: > {noformat} > |l >< r| = |scan(l)| * (|scan(r)| / ndv(r.b)) * (ndv(r.b) / ndv(l.a)) > = |scan(l)| * |scan(r)| / ndv(l.a) > = (|scan(l)| / ndv(l.a)) * |scan(r)| > {noformat} > Think if this as the inverse of the above: every row in r matches the set of > rows in l with the same key value. > h4. Code Bug > The code uses the following: > {code:java} > double lhsAdjNdv = slots.lhsNdv(); > if (slots.lhsNumRows() > lhsCard) lhsAdjNdv *= lhsCard / > slots.lhsNumRows(); > double rhsAdjNdv = slots.rhsNdv(); > if (slots.rhsNumRows() > rhsCard) rhsAdjNdv *= rhsCard / > slots.rhsNumRows(); > long joinCard = Math.round((lhsCard / Math.max(1, Math.max(lhsAdjNdv, > rhsAdjNdv))) * > rhsCard); > {code} > Where: > * {{lhsCard}} is the output cardinality of the l (previous join) table or > {{|l|}} > * {{rhsCard}} is the output cardinality of the r (scanned) table or > {{|scan(r)|}} > * {slots.lhsNdv()}} is the NDV of the left key or {{ndv(l.a)}} > * {slots.lnsNumRows()}} is the cardinality of the LHS table, or {{|l|}} > * {slots.rhsNdv()}} is the NDV of the right key or {{ndv(r.a)}} > * {slots.rnsNumRows()}} is the cardinality of the LHS table, or {{|r|}} > Translated to our notation: > {noformat} > adjNdv(l.a) = ndv(l.a) * ( |scan(l)| / |l| if |l| > |scan(l)|, 1 otherwise ) > adjNdv(r.b) = ndv(r.b) * ( |scan(r)| / |r| if |r| > |scan(r)|, 1 otherwise ) > |join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) ) > {noformat} > Since {{|scan\(x)| <= |x|}} we can reduce the above to: > {noformat} > adjNdv(l.a) = ndv(l.a) * |scan(l)| / |l| > adjNdv(r.b) = ndv(r.b) * |scan(r)| / |r| > |join| = |scan(l)| * |scan(r)| / max( adjNdv(l.a), adjNdv(r.b) ) > {noformat} > For one case: > {noformat} > |join| = |scan(l)| * |scan(r)| / adjNdv(l.a) > = |scan(l)| * |scan(r)| / (ndv(l.a) * |scan(l)| / |l|) > = |scan(l)| * |scan(r)| * |l| / (ndv(l.a) * |scan(l)|) > = |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a) > {noformat} > Comparing the two results > {noformat} > |l >< r| = |scan(l)| * |scan(r)| / ndv(l.a) > != |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a) > {noformat} > The only way we get the code's answer is if we assume {{|l| / |scan(r)| = 1}} > which seems unlikely. That is, the code applies an unnecessary multiplication > factor that increases as the ratio of scanned rows decreases. > So, a bug. > h4. Worked Example > Let's check. > * {{|l| = 10000}} > * {{|r| = 100}} > * {{scan(l) = 3000}} > * {{scan(r) = 50}} > * {{ndv(l.a) = 40}} > * {{ndv(r.a) = 20}} > Let's work this out intuitively first. The join is the normal cross join with > adjustments: > {noformat} > |l >< r| = |l| * |r| * adjustments > {noformat} > Adjustments: > * The left side scan returns only 30% of the total rows. > * The left hand has twice the key values of the right, so we have to discard > half of the left rows. > * Of the l keys with a match in r, it will match, on average, 100/20 = 5 rows. > * But half of r is filtered out, reducing the available matching rows. > And: > {noformat} > |l >< r| = (|l| / 2) * (|r| / 20 / 2) > = 10000 * 30% / 2 * 100 * / 20 / 2 > = 1500 * 2.5 > = 3750 > {noformat} > Using the formula derived above: > {noformat} > p(match) = ndv(r.b) / ndv(l.a) if ndv(r.b) < ndv(l.a), > ndv(l.a) / ndv(r.b) otherwise > |l >< r| = |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * p(match) > = |scan(l)| * |scan(r)| * (|r| / ndv(r.b)) * (ndv(r.b) / ndv(l.a)) > = |scan(l)| * |scan(r)| * |r| / ndv(l.a) > = 3000 * 50 * 100 / 40 > = 3750 > {noformat} > Which all works out. > Now let's check the code's formula: > {noformat} > |l >< r| = |scan(l)| * |scan(r)| * (|l| / |scan(l)|) / ndv(l.a) > = (|scan(l)| / |scan(l)|) * |scan(r)| * |l| / ndv(l.a) > = |scan(r)| * |l| / ndv(l.a) > = 50 * 10,000 / 50 > = 10,000 > {noformat} > Which does not quite work out, verifying the bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org