[ https://issues.apache.org/jira/browse/IMPALA-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned IMPALA-8217: ----------------------------------- Assignee: (was: Paul Rogers) > Remove scan operator exponential backoff in cardinality calcs > ------------------------------------------------------------- > > Key: IMPALA-8217 > URL: https://issues.apache.org/jira/browse/IMPALA-8217 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Priority: Major > > The planner's scan nodes use an exponential back-off to compute the joint > filter selectivity: > {code:java} > double result = 1.0; > for (int i = 0; i < selectivities.size(); ++i) { > // Exponential backoff for each selectivity multiplied into the final > result. > result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1)); > } > {code} > The reasoning is: > {code:java} > /* > * 2. Two selectivities, whether known or unknown, could be correlated. > Assuming > * independence can lead to significant underestimation. > * > * The second issue is addressed by an exponential backoff when multiplying > each > * additional selectivity into the final result. > */ > {code} > The logic makes some kind of sense, especially since Impala calculates > selectivity only for one kind of predicate: {{col = const}}, and ignores > selectivity for all others. (Actually, replaces all others with a single > combined selectivity estimate of 0.1.}} > The problem, however, is that this violates the commutative and linear models > that underly relational theory. In general, relational theory says it does > not matter if we join then filter, or filter, then join. If we have two > filters, f1 and f2, it does not matter if we apply f1 in the scan and f2 > after the join: the resulting result set is the same. (Performance differs, > but we ignore that when computing cardinality.) > With exponential back-off, this no longer applies. It becomes difficult, say, > to address the issue in IMPALA-XXXX in which we need to apply filter > selectivity at the scan level, then remove it again at the join level. The > reason is that the cardinality applied at the scan level may have been backed > off so simply using the expression's report of its cardinality may not be the > value used by the scan. > Also, oddly, the same expression could apply different selectivity in > different scans. The code sorts selectivities in ascending order. So, in scan > 1, filter f1 might apply its full selectivity, while in scan 2 (assuming a > correlated filter), it might apply the square or cube of the selectivity. > h4. Solution > The solution here is simply to: > * Compute the selectivity for all predicates so we don't have to fudge. > * Remove correlated predicates from join calcs (See IMPALA-XXXX). > * Remove exponential back-off calculations so that filter selectivity can be > properly decomposed for the above task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org