[ 
https://issues.apache.org/jira/browse/HIVE-29121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18012496#comment-18012496
 ] 

Stamatis Zampetakis commented on HIVE-29121:
--------------------------------------------

The HiveSubQueryRemoveRule is a modified copy from 
[SubQueryRemoveRule.java|https://github.com/apache/calcite/blob/4fa0d5bbe987c4ae0779ca434fb71e1ae8637b77/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java].
 Having slight variations of the same code in different repos adds a huge 
maintenance overhead and increases tech-dept. Probably by now the Hive variant 
has diverged a lot but if we ever want to consolidate the two classes we should 
shop modifying the former and focus on unifying the two and eventually dropping 
the Hive variant. If the changes that you are proposing bring the two 
implementations closer then its a net positive change so let me know and I will 
review the PR.

The only reason that HiveSemiJoin was introduced in the first place is because 
in the very early version of Calcite the 
[Join|https://github.com/apache/calcite/blob/4fa0d5bbe987c4ae0779ca434fb71e1ae8637b77/core/src/main/java/org/apache/calcite/rel/core/Join.java]
 operator couldn't model the semi joins. This has changed in CALCITE-2696 which 
also deprecated the SemiJoin expression. The suggested path forward would be to 
drop completely the HiveSemiJoin in favor of HiveJoin. If this change can also 
solve the perf issue that you discovered then it's definitely time well spend.

I don't remember all the details behind LoptOptimizeJoinRule so if the latter 
cannot handle the use-case that you have in mind we can continue iterating on 
the approach to modify HiveSubQueryRemoveRule.


> Restore HiveSubQueryRemoveRule to use InnerJoin instead of SemiJoin for 
> uncorrelated IN/EXISTS subqueries with RelOptUtil.Logic.TRUE.
> -------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29121
>                 URL: https://issues.apache.org/jira/browse/HIVE-29121
>             Project: Hive
>          Issue Type: Improvement
>         Environment: [^plan.example.txt]
>            Reporter: Seonggon Namgung
>            Assignee: Seonggon Namgung
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: plan.example.txt
>
>
> This JIRA is an addendum patch to HIVE-24685 and aims to restore the compiler 
> logic from HIVE-17767.
> During the substitution of HiveSubQRemoveRelBuilder with Calcite's RelBuilder 
> in HIVE-24685, Hive was changed to always use SemiJoin when handling 
> uncorrelated IN/EXISTS subqueries with logic == RelOptUtil.Logic.TRUE. Since 
> the SemiJoin is intended for use with correlated IN/EXISTS subqueries in 
> conjunction with AGGR removal (cf. HIVE-17767), we should avoid using 
> SemiJoin for the uncorrelated case, which neither benefits from AGGR removal 
> nor allows the application of rules that cannot handle HiveSemiJoin (e.g., 
> join reordering).
> For clarity, the following combinations of query plans are attached. From the 
> attached plans, we can observe that HIVE-24685 introduces a SemiJoin without 
> removing HiveAggregate, unlike HIVE-17767.
> The attached plans cover the following combinations:
> * {Before HIVE-17767, After HIVE-17767, After HIVE-24685}
> * {Correlated, Uncorrelated}
> * {Before subquery removal, After subquery removal, After decorrelation}
> We discovered this issue while investigating a performance regression in 
> TPC-DS Query 23.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to