[
https://issues.apache.org/jira/browse/HIVE-29692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092688#comment-18092688
]
Stamatis Zampetakis commented on HIVE-29692:
--------------------------------------------
The query is part of the
[subquery_in.q|https://github.com/apache/hive/blob/fc877ff2130cad5d8db364ffa129e3e717674449/ql/src/test/queries/clientpositive/subquery_in.q#L225]
file and is interesting because the same correlated variable from the outer
query "part.p_size" is used multiple times in the inner query. The plan I is
certainly preferred to plan II in terms of performance but as a side effect of
fixing HIVE-29688 we are now getting plan II.
The goal is to check if we can remove the extra scan + join without
reintroducing the crashes reported under HIVE-29688 or other unintended
side-effects.
> Improve performance of correlated IN query with multiple correlated conditions
> ------------------------------------------------------------------------------
>
> Key: HIVE-29692
> URL: https://issues.apache.org/jira/browse/HIVE-29692
> Project: Hive
> Issue Type: Improvement
> Reporter: Stamatis Zampetakis
> Priority: Major
>
> {code:sql}
> select p_partkey from part where p_name in
> (select p.p_name from part p left outer join part pp on p.p_type =
> pp.p_type where pp.p_size = part.p_size and p.p_size=part.p_size)
> {code}
> +Plan I+
> {noformat}
> CBO PLAN:
> HiveProject(p_partkey=[$0])
> HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
> HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
> HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
> HiveTableScan(table=[[default, part]], table:alias=[part])
> HiveProject(p_name=[$0], p_size0=[$3])
> HiveJoin(condition=[=($1, $2)], joinType=[inner], algorithm=[none],
> cost=[not available])
> HiveProject(p_name=[$1], p_type=[$4])
> HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4), IS NOT
> NULL($1))])
> HiveTableScan(table=[[default, part]], table:alias=[p])
> HiveProject(p_type=[$4], p_size=[$5])
> HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4))])
> HiveTableScan(table=[[default, part]], table:alias=[pp])
> {noformat}
> +Plan II+
> {noformat}
> CBO PLAN:
> HiveProject(p_partkey=[$0])
> HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
> HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
> HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
> HiveTableScan(table=[[default, part]], table:alias=[part])
> HiveProject(p_name=[$0], p_size1=[$3])
> HiveJoin(condition=[AND(=($5, $3), =($1, $4))], joinType=[inner],
> algorithm=[none], cost=[not available])
> HiveJoin(condition=[=($2, $3)], joinType=[inner], algorithm=[none],
> cost=[not available])
> HiveProject(p_name=[$1], p_type=[$4], p_size=[$5])
> HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5), IS
> NOT NULL($1))])
> HiveTableScan(table=[[default, part]], table:alias=[p])
> HiveProject(p_size=[$0])
> HiveAggregate(group=[{5}])
> HiveFilter(condition=[IS NOT NULL($5)])
> HiveTableScan(table=[[default, part]], table:alias=[part])
> HiveProject(p_type=[$4], p_size=[$5])
> HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5))])
> HiveTableScan(table=[[default, part]], table:alias=[pp])
> {noformat}
> Both plans return the same result set but Plan I is more efficient. Plan II,
> contains an additional scan of the "part" table
> {noformat}
> HiveProject(p_size=[$0])
> HiveAggregate(group=[{5}])
> HiveFilter(condition=[IS NOT NULL($5)])
> HiveTableScan(table=[[default, part]], table:alias=[part])
> {noformat}
> that is followed by an addition join with "p".
--
This message was sent by Atlassian Jira
(v8.20.10#820010)