[ https://issues.apache.org/jira/browse/IMPALA-8262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers reassigned IMPALA-8262: ----------------------------------- Assignee: (was: Paul Rogers) > Join cardinality not decreased by join filter selectivity > --------------------------------------------------------- > > Key: IMPALA-8262 > URL: https://issues.apache.org/jira/browse/IMPALA-8262 > Project: IMPALA > Issue Type: Bug > Components: Frontend > Affects Versions: Impala 3.1.0 > Reporter: Paul Rogers > Priority: Major > > Consider a subset of the plan for TPC-H query 7. (See {{tpch-all.test}} for > details.) > {noformat} > 11:AGGREGATE [FINALIZE] > | output: sum(l_extendedprice * (1 - l_discount)) > | group by: n1.n_name, n2.n_name, year(l_shipdate) > | row-size=58B cardinality=575.77K > | > 10:HASH JOIN [INNER JOIN] > | hash predicates: c_nationkey = n2.n_nationkey > | other predicates: ((n1.n_name = 'FRANCE' AND n2.n_name = 'GERMANY') OR > (n1.n_name = 'GERMANY' AND n2.n_name = 'FRANCE')) > | row-size=132B cardinality=575.77K > | > |--05:SCAN HDFS [tpch.nation n2] > | row-size=21B cardinality=25 > | > 09:HASH JOIN [INNER JOIN] > | hash predicates: s_nationkey = n1.n_nationkey > | row-size=111B cardinality=575.77K > {noformat} > Here, we have join 09 feeding 576K rows into join 10. All 576K rows pass > along to the aggregate 11. Notice, however, that join 10 has a that picks out > 2 of the 25 countries in each of two paths. The selectivity of the filters > should be something like 2 * 2/25 = 0.16. Thus, the output cardinality of the > 10 join should be 577K * 0.16 = 92K. > The problem is that the join cardinality calculations don't consider join > filter selectivity. > It may be that this was done to handle the outer join case, in which filters > applied in the outer-side scan must be re-applied on the join. Omitting the > filters avoids duplicate accounting for the selectivity. > But, that case is special and should be handled specially as part of > IMPALA-8213. Except for correlated filters, the planner *should* apply join > filter selectivity to the join output cardinality calculations. > This error has consequences. The filter should reduce the number of rows > though the join. Because it does so, it should come early in the join tree to > reduce the set of rows processed. But, because selectivity is ignored, the > planner does not see the join as a filter, and ends up putting the join 10 at > the top of the join tree. (See the test file for the full plan.) The result > is that Impala schleps around many more rows than necessary, only to discard > them near the top of the DAG. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org