[jira] [Commented] (HIVE-26653) Wrong results when (map) joining multiple tables on partition column

Stamatis Zampetakis (Jira) Sat, 11 Oct 2025 14:52:46 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18028578#comment-18028578
 ]


Stamatis Zampetakis commented on HIVE-26653:
--------------------------------------------

Thanks for checking [~soumyakanti.das]. The fact that constant folding at CBO 
level resolves the issue is kinda suspicious so not sure if 
https://github.com/apache/hive/commit/36ce858163a19e29eafe4a8d3307191bc28fc175 
is really a fix unless the old plan was for some reason invalid. It seems that 
in the old plan the map join was performed on STRING columns (that were 
basically literals) while in the new plan the join is performed on BOOLEAN 
columns. The join type is an important part of the equation since it leads to a 
completely different vectorized implementation under the hood. There are many 
different VectorMapJoin implementations 
(https://github.com/apache/hive/tree/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/mapjoin)
 and the bug may be present in just one of them. {{EXPLAIN VECTORIZED}} before 
and after the fix should help verify if the Map join implementation has changed.


> Wrong results when (map) joining multiple tables on partition column
> --------------------------------------------------------------------
>
>                 Key: HIVE-26653
>                 URL: https://issues.apache.org/jira/browse/HIVE-26653
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: hive_26653.q, hive_26653_explain.txt, 
> hive_26653_explain_cbo.txt, table_a.csv, table_b.csv
>
>
> The result of the query must have exactly one row matching the date specified 
> in the WHERE clause but the query returns nothing.
> {code:sql}
> CREATE TABLE table_a (`aid` string ) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tbla.csv' into TABLE table_a;
> CREATE TABLE table_b (`bid` string) PARTITIONED BY (`p_dt` string)
> row format delimited fields terminated by ',' stored as textfile;
> LOAD DATA LOCAL INPATH '../../data/files/_tblb.csv' into TABLE table_b;
> set hive.auto.convert.join=true;
> set hive.optimize.semijoin.conversion=false;
> SELECT a.p_dt
> FROM ((SELECT p_dt
>        FROM table_b
>        GROUP BY p_dt) a
>          JOIN
>      (SELECT p_dt
>       FROM table_a
>       GROUP BY p_dt) b ON a.p_dt = b.p_dt
>          JOIN
>      (SELECT p_dt
>       FROM table_a
>       GROUP BY p_dt) c ON a.p_dt = c.p_dt)
> WHERE a.p_dt =  translate(cast(to_date(date_sub('2022-08-01', 1)) AS string), 
> '-', '');
> {code}
> +Expected result+
> 20220731
> +Actual result+
> Empty
> To reproduce the problem the tables need to have some data. Values in aid and 
> bid columns are not important. For p_dt column use one of the following 
> values 20220731, 20220630.
> I will attach some sample data with which the problem can be reproduced. The 
> tables look like below.
> ||aid|pdt||
> |611|20220731|
> |239|20220630|
> |...|...|
> The problem can be reproduced via qtest in current master 
> (commit 
> [6b05d64ce8c7161415d97a7896ea50025322e30a|https://github.com/apache/hive/commit/6b05d64ce8c7161415d97a7896ea50025322e30a])
>  by running the TestMiniLlapLocalCliDriver.
> There is specific query plan (will attach shortly) for which the problem 
> shows up so if the plan changes slightly the problem may not appear anymore; 
> this is why we need to set explicitly hive.optimize.semijoin.conversion and 
> hive.auto.convert.join to trigger the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26653) Wrong results when (map) joining multiple tables on partition column

Reply via email to