[ https://issues.apache.org/jira/browse/IMPALA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780096#comment-16780096 ]
Paul Rogers commented on IMPALA-8014: ------------------------------------- Another bug is that the PK/FK calculation assumes that |FK| = |PK|, which would be the case for a true M:1 relationship. However, many of our tests involve situations in which one table has unique keys, but the other is not a true join key (it may also have unique keys.) Consider: {noformat} 03:HASH JOIN [INNER JOIN] | hash predicates: a.id = b.id, a.int_col = b.int_col | row-size=178B cardinality=7.30K | |--01:SCAN HDFS [functional.alltypestiny b] | partitions=4/4 files=4 size=460B | row-size=89B cardinality=8 | 00:SCAN HDFS [functional.alltypes a] partitions=24/24 files=24 size=478.45KB row-size=89B cardinality=7.30K {noformat} Notice that the 03 join is estimated to produce 7300 rows. The join has a compound key and we assume that {{(a.id, a.int_col)}} is an FK, and must thus have the same cardinality as the master table. So, all 7300 detail rows should find a match. But, in reality, the {{alltypestiny}} has only 8 rows. NDV(a.id) is 7300. There are simply not enough rows to match all the rows from {{alltypes}}. The correct result appears after making the math correction: {noformat} |--03:HASH JOIN [INNER JOIN] | | hash predicates: a.id = b.id, a.int_col = b.int_col | | row-size=178B cardinality=8 | | | |--01:SCAN HDFS [functional.alltypestiny b] | | partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 00:SCAN HDFS [functional.alltypes a] | partitions=24/24 files=24 size=478.45KB | row-size=89B cardinality=7.30K {noformat} This version says that, if we have 7300 unique ids on one side, and 8 on the other, then an inner join can match at most 8 of those rows. The old estimate was off by two orders of magnitude. This particular issue appears to occur only for compound keys (those with two or more parts.) > Incorrect M:1 (FK/PK) cardinality estimation > -------------------------------------------- > > Key: IMPALA-8014 > URL: https://issues.apache.org/jira/browse/IMPALA-8014 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Major > > The math is wrong when we estimate the cardinality of a join we've labeled as > "FK/PK" (commonly known as many-to-one or M:1.) > h4. Join Estimation > The standard calculation for joins is explained in Swami & Schiefer, [On the > Estimation of Join Result > Sizes|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf] > (S&S). Note especially section 3, Background. Also see [How Good Are Query > Optimizers, Really?|http://www.vldb.org/pvldb/vol9/p204-leis.pdf] by Leis et > al. > {noformat} > |L'| * |R'| > |L' ⋈ R'| = ------------------ > max(|L.k'|, |R.k'|) > {noformat} > Here: > * {{|x|}} is the row cardinality if x is a relation, the NDV (domain > cardinality) if x is a column. > * {{L}}, {{R}} are left and right input relations. > * {{L.k}} and {{R.k}} are the (possibly compound) join keys. > * {{x’}} is the result of applying a filter to relation or column {{x}}. > Intuitively, the cardinality of the join is the Cartesian product reduced by > the largest column domain. Since the relations fed into the join are > filtered, we are concerned with the new, filtered relations created by the > scan nodes. > Impala, like most planners, makes three basic assumptions (see the first > paper above): > * Uniformity: keys are evenly distributed. > * Independence: the filtering on the two tables is independent of keys. > * Containment: that all detail (FK) rows have a corresponding master (FK) row. > (These are traditional assumptions, but they turn out to be unrealistic in > many cases; see the second paper above.) > h4. M:1 Case > The planner (unnecessarily) divides join planning into two cases. > (IMPALA-8018 describes how the two steps are unnecessary. But, to minimize > code change, we live with this process here.) > * M:1 (AKA "FK/PK", "detail/master", "fact/dimension"). A "foreign key" (FK) > in the detail table matches at most one "primary key" (PK) in the master > table. ("At most one" because of filtering which may have removed master > keys.) Impala assumes that the detail ("FK") table is on the left (probe) > side, and the master ("PK") table is on the right ("build") side. > * M:N (AKA "generic", "many-to-many"). Every key on the left matches > potentially many keys on the right. > If we focus on just the M:1 case we can rename the relations for convenience > and observe a simplification: > {noformat} > L = D > R = M > L.k = M.pk > R.k = D.fk > {noformat} > By definition of M:1: > {noformat} > |M.pk'| = |M'| > {noformat} > Which gives a revised join expression: > {noformat} > |D'| * |M'| > |D' ⋈ M'| = ------------------ > max(|D.fk'|, |M'|) > {noformat} > The expression above is handy: it shows that we only need three values to > compute the join cardinality: > * {{|D'|}} the output cardinality from the left plan node, which is already > available. > * {{|M'|}} the output cardinality from the right scan node, which is already > available. (In Impala, the right node will always be a table scan.) > * {{|D.fk'|}} the number of foreign key values in the filtered {{D'}} > relation. (In relational theory terms, the cardinality of the domain of the > foreign key after applying pushed-down selection operations.) > See IMPALA-XXXX for a necessary adjustment to {{|M'|}} to handle predicates > common to both sides. > We only need to estimate {{|D.fk'|}}. We will work up to this step by step > because the intermediate steps help explain the bug in the current code. > h4. Filtering of Master Table Only > Let's start with the simplest case, filtering on only the master table: > {noformat} > |D.fk'| = |D.fk| <= |M.pk| > max(|D.fk'|, |M'|) = max(|M.pk|, |M'|) = |M.pk| = |M| > {noformat} > So: > {noformat} > |D ⋈ M'| = |D| * |M'| / |M| > {noformat} > Intuitively: the probability of any detail row finding a match is simply the > selectivity of the master filter, or {{|M'| / |M|}}. > h4. Filtering on the Foreign Key Column > The next case is also fairly easy. Suppose we know that the left (detail) > scan applied a filter. Impala presently uses an incorrect, but simplified, > model for this case. (See IMPALA-XXXX for the correct model.) > The cardinality of the foreign key is simply the result of applying that > filter: > {noformat} > |D.fk'| = |D.fk| * sel(f) > {noformat} > Let us assume we do not filter the master table in this case, so: > {noformat} > |D.fk'| < |D.fk| <= |M.pk| > max(|D.fk'|, |M|) = |M| > {noformat} > And: > {noformat} > |D' ⋈ M| = |D'| * |M| / |M| = |D'| > {noformat} > We assume "containment" from the S&S paper above: all the foreign keys have a > matching primary key if {{|F.fk| <= |M.pk|}}. So, if we filter only on the > detail table, all foreign keys find a match and the cardinality of the join > is just the cardinality of the left input relation. > h4. Combined Left and Right Filtering > Now, let's combine the master and detail filtering cases. In general, we will > have filtering on both sides. The {{max}} in the expression automatically > combines the cases: > {noformat} > |D.fk'| = |D.fk| * |D'| / |D| > |D'| * |M'| > |D' ⋈ M'| = ------------------ > max(|D.fk'|, |M'|) > {noformat} > Intuitively, the number of rows is the Cartesian product divided by the > larger of the number of keys in either table after the scan. Said another > way, each detail row finds a master, unless filtering has removed so many > masters that some detail rows find no match, in which case the probability of > a match is {{|M'| / |D.fk'|}}. > h4. Compound Primary Keys > The final complexity is to consider a compound key. That is: > {noformat} > (D.fk1, Dfk2) --> (M.pk1, M.pk2) > {noformat} > The foreign key pair points to a matching primary key pair. Here we consider > only pairs, but the logic is the same for a compound key with any number of > columns. Obviously, by the definition of a join in SQL, the number of keys on > each side must be the same. > If we know (from HMS metadata) that the above pairs are, in fact, the keys, > then we can make a simplifying assumption: > {noformat} > |(M.pk1, M.pk2)| = |M| > |(D.fk1, Dfk2)| <= |(M.pk1, M.pk2)| = |M| > {noformat} > If we don't know a-priori that a pair (k1, k1) is a foreign or primary key, > then we can estimate its cardinality (assuming independence of values) as: > {noformat} > |(k1, k2)| = |k1| * |k2| > {noformat} > In the M:1 case, the primary key cardinality must be the same as table > cardinality. If the two columns are completely independent, then: > {noformat} > |(M.pk1, M.pk2)| = |M.pk1| * |M.pk2| = |M| > {noformat} > More typically, there is some correlation between the columns so: > {noformat} > |(M.pk1, M.pk2)| = |M| <= |M.pk1| * |M.pk2| > {noformat} > Indeed, Impala uses the above relation to decide we have the FK/PK case. Said > another way, if Impala follows the FK/PK logic, then we can simply assume > that {{|M.pk| = |M|}} even if the key is compound. > h4. Compound Foreign Keys > Foreign keys require a bit more thought. We could have a detail table with > only one row. If we have many detail rows, the assumption of containment says > that each detail record points to some master record, so: > {noformat} > |D.fk| <= |M.pk| > {noformat} > This lets us estimate the cardinality of a compound foreign key as: > {noformat} > |(D.fk1, D.fk2)| = min( |D.fk1| * |D.fk2|, |M| ) > {noformat} > That is, if the detail table is small, the left term is a reasonable estimate > (ignoring the urn model issue). But, as the table gets larger, and begins to > include most primary keys, we know that the number of foreign keys can't be > larger than the number of primary keys, so the right term is the better > estimate. > h4. Filtering on Compound Keys > Suppose that the join inputs have filtering applied to the left (detail) > table. We discussed how to handle this fo a single column. For a compound > key, we can observe that the cardinality of the key is the product of the > cardinality of the columns, but (using the containment assumption), no larger > than the cardinality of the primary key (which is the cardinality of the > master table.) > So: > {noformat} > |D.fk| = min( ∏ |D.fki|, |M| ) > > |D.fk'| = |D.fk| * |D'| / |D| > = min( ∏ |D.fki|, |M| ) * |D'| / |D| > {noformat} > So the final expression for join cardinality is: > {noformat} > |D.fk'| = min( ∏ |D.fki|, |M| ) * |D'| / |D| > |D'| * |M'| > |D' ⋈ M'| = ------------------- > max( |D.fk'|, |M'|) > {noformat} > The first expression says that the compound foreign key cardinality is either > the product of the columns that make up the key, or the cardinality of the > primary key, whichever is less. We then adjust that amount (incorrectly) by > the percentage of the detail table that is scanned. > The second expression is just the Cartesian product divided by the largest > key cardinality: either foreign key or primary key (which is, by definition, > equal to the cardinality of the master table.) > h4. Complexity: Compound Joins > The above expression works well if the left input to a join is a base table > row which we know the original table cardinality that corresponds to the > original column NDV. The above can be used as-is in a M:N ("generic") join > for a right-side table. But, when when used on the left side, we must recall > that Impala builds left-deep join plans, so the left side may be a join. In > this case, there is no original left-side base table. > See IMPALA-XXXX for discussion of the complexities in this case. A quick and > dirty solution is to use the scan output cardinality in place of the left > input cardinality. That is: > * Determine the table that contains the join key. > * Search the left subtree for the scan for that table. > To do this, start with the left input: > * If the node is a scan node, and the scan is for the target table, return > the output cardinality of that scan. > * If the node is a join, then apply the search to *both* sides of the join. > This approach is cumbersome, and may run into complexities if the node is > something other than a scan or join. It also may underestimate if a filter is > applied at the join level. > A cleaner, tough more involved, solution is to track adjusted NDV for each > column through each operator as described in IMPALA-8220. > h4. Code Bug > The commit "IMPALA-5547: Rework FK/PK join detection", ID > [{{9f678a74269250bf5c7ae2c5e8afd93c5b3734de}}|https://github.com/apache/impala/commit/9f678a74269250bf5c7ae2c5e8afd93c5b3734de#diff-b10ccc2cdf68b236be400cde1e858a7c] > on Jun 6, 2017 reworked the FK/PK logic. It has one flaw: after determining > that we have an FK/PK (M:1) case, it then attempts to adjust the compound FK > and PK columns. This has two problems: > * It is unnecessary, as we saw above. > * The math is wrong and produces bogus estimates. > Here is the code in {{JoinNode.java}}: > {code:java} > long result = -1; > for (EqJoinConjunctScanSlots slots: eqJoinConjunctSlots) { > // Adjust the join selectivity based on the NDV ratio to avoid > underestimating > // the cardinality if the PK side has a higher NDV than the FK side. > double ndvRatio = 1.0; > if (slots.lhsNdv() > 0) ndvRatio = slots.rhsNdv() / slots.lhsNdv(); > double rhsSelectivity = Double.MIN_VALUE; > if (slots.rhsNumRows() > 0) rhsSelectivity = rhsCard / > slots.rhsNumRows(); > long joinCard = (long) Math.ceil(lhsCard * rhsSelectivity * ndvRatio); > if (result == -1) { > result = joinCard; > } else { > result = Math.min(result, joinCard); > } > } > // FK/PK join cardinality must be <= the lhs cardinality. > result = Math.min(result, lhsCard); > {code} > The above is hard to follow, which may account for why the bug was not > caught. Ignoring some corner cases, the logic is essentially: > {noformat} > lhsCard = |D'| > ndvRatioi = |D.fki| / |M.pki| > rhsSelectivityi = |M'| / |M| > joinCardi = lhsCard * ndvRatioi * rhsSelectivityi > = |D'| * (|D.fki| / |M.pki|) * |M'| / |M| > |join| = min(joinCardi) > = (|D'| * |M'| / |M|) * min(|D.fki| / |M.pki|) > {noformat} > Though hard to see, the above is not equivalent to the logic worked out in > the previous section. Using the correct expression from earlier sections: > {noformat} > |D'| * |M'| > |D' ⋈ M'| = ---------------------------------------------- > max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|) > {noformat} > We can factor out the common {{|D'| * |M'|}} terms and compare: > {noformat} > min(|D.fki| / |M.pki|) 1 > ---------------------- != ---------------------------------------------- > |M| max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), |M'|) > 1 1 > ---------------------------- != > ---------------------------------------------- > |M| * max(|M.pki| / |D.fki|) max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), > |M'|) > |M| * max(|M.pki| / |D.fki|) != max(min( ∏ |D.fki|, |M| ) * |D'| / |D|), > |M'|) > {noformat} > There is no valid operation that can convert one side to the other, so they > are unequal. > It is likely that the code's version attempts to work around issues elsewhere > in the calculations (such as ignoring some predicates, using exponential > back-off for filters, not having a good estimate for {{|D|}}, etc.) > h4. Longer-Term Fix > The above simple fix is the target of this ticket. Longer term, the code > should evolve to use a single path for both the M:1 and M:N cases since as > described in IMPALA-8018. (Both cases start with HMS data. Currently we use > two paths to arrive at the same result. IMPALA-8018 suggests we need only one > path.) We should also adopt the simple urn model as described in IMPALA-8218. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org