[jira] [Commented] (HIVE-29692) Improve performance of correlated IN query with multiple correlated conditions

Stamatis Zampetakis (Jira) Tue, 30 Jun 2026 08:27:22 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-29692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18092688#comment-18092688
 ]


Stamatis Zampetakis commented on HIVE-29692:
--------------------------------------------

The query is part of the 
[subquery_in.q|https://github.com/apache/hive/blob/fc877ff2130cad5d8db364ffa129e3e717674449/ql/src/test/queries/clientpositive/subquery_in.q#L225]
 file and is interesting because the same correlated variable from the outer 
query "part.p_size" is used multiple times in the inner query. The plan I is 
certainly preferred to plan II in terms of performance but as a side effect of 
fixing HIVE-29688 we are now getting plan II.

The goal is to check if we can remove the extra scan + join without 
reintroducing the crashes reported under HIVE-29688 or other unintended 
side-effects.

> Improve performance of correlated IN query with multiple correlated conditions
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-29692
>                 URL: https://issues.apache.org/jira/browse/HIVE-29692
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Stamatis Zampetakis
>            Priority: Major
>
> {code:sql}
> select p_partkey from part where p_name in
>     (select p.p_name from part p left outer join part pp on p.p_type = 
> pp.p_type where pp.p_size = part.p_size and p.p_size=part.p_size)
> {code}
> +Plan I+
> {noformat}
> CBO PLAN:
> HiveProject(p_partkey=[$0])
>   HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
>     HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
>       HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
>         HiveTableScan(table=[[default, part]], table:alias=[part])
>     HiveProject(p_name=[$0], p_size0=[$3])
>       HiveJoin(condition=[=($1, $2)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>         HiveProject(p_name=[$1], p_type=[$4])
>           HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4), IS NOT 
> NULL($1))])
>             HiveTableScan(table=[[default, part]], table:alias=[p])
>         HiveProject(p_type=[$4], p_size=[$5])
>           HiveFilter(condition=[AND(IS NOT NULL($5), IS NOT NULL($4))])
>             HiveTableScan(table=[[default, part]], table:alias=[pp])
> {noformat}
> +Plan II+
> {noformat}
> CBO PLAN:
> HiveProject(p_partkey=[$0])
>   HiveSemiJoin(condition=[AND(=($1, $3), =($4, $2))], joinType=[semi])
>     HiveProject(p_partkey=[$0], p_name=[$1], p_size=[$5])
>       HiveFilter(condition=[AND(IS NOT NULL($1), IS NOT NULL($5))])
>         HiveTableScan(table=[[default, part]], table:alias=[part])
>     HiveProject(p_name=[$0], p_size1=[$3])
>       HiveJoin(condition=[AND(=($5, $3), =($1, $4))], joinType=[inner], 
> algorithm=[none], cost=[not available])
>         HiveJoin(condition=[=($2, $3)], joinType=[inner], algorithm=[none], 
> cost=[not available])
>           HiveProject(p_name=[$1], p_type=[$4], p_size=[$5])
>             HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5), IS 
> NOT NULL($1))])
>               HiveTableScan(table=[[default, part]], table:alias=[p])
>           HiveProject(p_size=[$0])
>             HiveAggregate(group=[{5}])
>               HiveFilter(condition=[IS NOT NULL($5)])
>                 HiveTableScan(table=[[default, part]], table:alias=[part])
>         HiveProject(p_type=[$4], p_size=[$5])
>           HiveFilter(condition=[AND(IS NOT NULL($4), IS NOT NULL($5))])
>             HiveTableScan(table=[[default, part]], table:alias=[pp])
> {noformat}
> Both plans return the same result set but Plan I is more efficient. Plan II, 
> contains an additional scan of the "part" table 
> {noformat}
>           HiveProject(p_size=[$0])
>             HiveAggregate(group=[{5}])
>               HiveFilter(condition=[IS NOT NULL($5)])
>                 HiveTableScan(table=[[default, part]], table:alias=[part])
> {noformat}
> that is followed by an addition join with "p".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29692) Improve performance of correlated IN query with multiple correlated conditions

Reply via email to