Stamatis Zampetakis created HIVE-29692:
------------------------------------------
Summary: Improve performance of correlated IN query with multiple
correlated conditions
Key: HIVE-29692
URL: https://issues.apache.org/jira/browse/HIVE-29692
Project: Hive
Issue Type: Improvement
Reporter: Stamatis Zampetakis
{code:sql}
select p_partkey from part where p_name in
(select p.p_name from part p left outer join part pp on p.p_type =
pp.p_type where pp.p_size = part.p_size and p.p_size=part.p_size)
{code}
+Plan I+
{noformat}
CBO PLAN:
HiveProject(p_partkey=[$0])
HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
HiveTableScan(table=[[default, part]], table:alias=[part])
HiveProject(p_name=[$0], p_size0=[$3])
HiveJoin(condition=[=($1, $2)], joinType=[inner], algorithm=[none],
cost=[not available])
HiveProject(p_name=[$1], p_type=[$4])
HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4), IS NOT
NULL($1))])
HiveTableScan(table=[[default, part]], table:alias=[p])
HiveProject(p_type=[$4], p_size=[$5])
HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4))])
HiveTableScan(table=[[default, part]], table:alias=[pp])
{noformat}
+Plan II+
{noformat}
CBO PLAN:
HiveProject(p_partkey=[$0])
HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
HiveTableScan(table=[[default, part]], table:alias=[part])
HiveProject(p_name=[$0], p_size1=[$3])
HiveJoin(condition=[AND(=($5, $3), =($1, $4))], joinType=[inner],
algorithm=[none], cost=[not available])
HiveJoin(condition=[=($2, $3)], joinType=[inner], algorithm=[none],
cost=[not available])
HiveProject(p_name=[$1], p_type=[$4], p_size=[$5])
HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5), IS NOT
NULL($1))])
HiveTableScan(table=[[default, part]], table:alias=[p])
HiveProject(p_size=[$0])
HiveAggregate(group=[{5}])
HiveFilter(condition=[IS NOT NULL($5)])
HiveTableScan(table=[[default, part]], table:alias=[part])
HiveProject(p_type=[$4], p_size=[$5])
HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5))])
HiveTableScan(table=[[default, part]], table:alias=[pp])
{noformat}
Both plans return the same result set but Plan I is more efficient. Plan II,
contains an additional scan of the "part" table
{noformat}
HiveProject(p_size=[$0])
HiveAggregate(group=[{5}])
HiveFilter(condition=[IS NOT NULL($5)])
HiveTableScan(table=[[default, part]], table:alias=[part])
{noformat}
that is followed by an addition join with "p".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)