[ 
https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-27226:
----------------------------------
    Labels: hive-4.0.0-must pull-request-available  (was: hive-4.0.0-must)

> FullOuterJoin with filter expressions is not computed correctly
> ---------------------------------------------------------------
>
>                 Key: HIVE-27226
>                 URL: https://issues.apache.org/jira/browse/HIVE-27226
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0-beta-1
>            Reporter: Seonggon Namgung
>            Priority: Critical
>              Labels: hive-4.0.0-must, pull-request-available
>
> I tested many OuterJoin queries as an extension of HIVE-27138, and I found 
> that Hive returns incorrect result for a query containing FullOuterJoin with 
> filter expressions. In a nutshell, all JoinOperators that run on Tez engine 
> return incorrect result for OuterJoin queries, and one of the reason for 
> incorrect computation comes from CommonJoinOperator, which is the base of all 
> JoinOperators. I attached the queries and configuration that I used at the 
> bottom of the document. I am still inspecting this problems, and I will share 
> an update once when I find out another reason. Also any comments and opinions 
> would be appreciated.
> First of all, I observed that current Hive ignores filter expressions 
> contained in MapJoinOperator. For example, the attached result of query1 
> shows that MapJoinOperator performs inner join, not full outer join. This 
> problem stems from removal of filterMap. When converting JoinOperator to 
> MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() 
> removes filterMap of MapJoinOperator. Because MapJoinOperator does not 
> evaluate filter expressions if filterMap is null, this change makes 
> MapJoinOperator ignore filter expressions and it always joins tables 
> regardless whether they satisfy filter expressions or not. To solve this 
> problem, I disable FullOuterMapJoinOptimization and apply path for 
> HIVE-27138, which prevents NPE. (The patch is available at the following 
> link: LINK.) The rest of this document uses this modified Hive, but most of 
> problems happen to current Hive, too.
> The second problem I found is that Hive returns the same left-null or 
> right-null rows multiple time when it uses MapJoinOperator or 
> CommonMergeJoinOperator. This is caused by the logic of current 
> CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. 
> First, they create RowContainers, each of which is a group of rows from one 
> table and has the same key. Second, they call 
> CommonJoinOperator#checkAndGenObject() with created RowContainers. This 
> method checks filterTag of each row in RowContainers and forwards joined row 
> if they meet all filter conditions. For OuterJoin, checkAndGenObject() 
> forwards non-matching rows if there is no matching row in RowContainer. The 
> problem happens when there are multiple RowContainer for the same key and 
> table. For example, suppose that there are two left RowContainers and one 
> right RowContainer. If none of the row in two left RowContainers satisfies 
> filter condition, then checkAndGenObject() will forward Left-Null row for 
> each right row. Because checkAndGenObject() is called with each left 
> RowContainer, there will be two duplicated Left-Null rows for every right row.
> In the case of MapJoinOperator, it always creates singleton RowContainer for 
> big table. Therefore, it always produces duplicated non-matching rows. 
> CommonMergeJoinOperator also creates multiple RowContainer for big table, 
> whose size is hive.join.emit.interval. In the below experiment, I also set 
> hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to 
> disable specialized algorithm for OuterJoin of 2 tables and force calling 
> checkAndGenObject() before all rows with the same keys are gathered. I didn't 
> observe this problem when using VectorMapJoinOperator, and I will inspect 
> VectorMapJoinOperator whether we can reproduce the problem with it.
> I think the second problem is not limited to FullOuterJoin, but I couldn't 
> find such query as of now. This will also be added to this issue if I can 
> write a query that reproduces the second problem without FullOuterJoin.
> I also found that Hive returns wrong result for query2 even when I used 
> VectorMapJoinOperator. I am still inspecting this problem and I will add an 
> update on it when I find out the reason.
>  
> Experiment:
>  
> {code:java}
> ---- Configuration
> set hive.optimize.shared.work=false;
> -- Std MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=false;
> -- Vec MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=true;
> -- MergeJoin
> set hive.auto.convert.join=false;
> set hive.vectorized.execution.enabled=false;
> set hive.join.shortcut.unmatched.rows=false;
> set hive.join.emit.interval=1;
> set hive.exec.reducers.max=1;
>  
> ---- Queries
> -- Query 1
> DROP TABLE IF EXISTS a;
> CREATE TABLE a (key string, value string);
> INSERT INTO a VALUES (1, 1), (1, 2), (2, 1);
> SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0;
> -- Query 2
> DROP TABLE IF EXISTS b;
> CREATE TABLE b (key string, value string);
> INSERT INTO b VALUES (1, 0), (1, 1);
> SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND 
> b.value > 0;{code}
>  
>  
> Experiment result:
>  
> {code:java}
> -- PostgresSQL
> -- Query1
> key | value | key | value
> -----+-------+-----+-------
>    1 |     1 |     |
>    1 |     2 |     |
>    2 |     1 |     |
>      |       |   1 |     2
>      |       |   1 |     1
>      |       |   2 |     1
> (6 rows)
> -- Query2
>  key | value | key | value
> -----+-------+-----+-------
>    1 |     0 |     |      
>    1 |     1 |   1 |     1
>      |       |   1 |     0
> (3 rows){code}
> {code:java}
> -- Query1 Result, current Hive
> -- Std MapJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 2      | 1        | 2      | 1        |
> | 1      | 2        | 1      | 2        |
> | 1      | 2        | 1      | 1        |
> | 1      | 1        | 1      | 2        |
> | 1      | 1        | 1      | 1        |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 1      | 2        | NULL   | NULL     |
> | 1      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 2        |
> | NULL   | NULL     | 1      | 1        |
> | 2      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 2      | 1        |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 1      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 1        |
> | NULL   | NULL     | 1      | 2        |
> | 1      | 2        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 1        |
> | NULL   | NULL     | 1      | 2        |
> | 2      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 2      | 1        |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query1 Result, Hive with HIVE-27138 patch, disable 
> FullOuterMapJoinOptimization
> -- Std MapJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 2      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 2      | 1        |
> | 1      | 2        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 2        |
> | NULL   | NULL     | 1      | 1        |
> | 1      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 2        |
> | NULL   | NULL     | 1      | 1        |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 1      | 2        | NULL   | NULL     |
> | 1      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 2        |
> | NULL   | NULL     | 1      | 1        |
> | 2      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 2      | 1        |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | a.key  | a.value  | b.key  | b.value  |
> +--------+----------+--------+----------+
> | 1      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 1        |
> | NULL   | NULL     | 1      | 2        |
> | 1      | 2        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 1        |
> | NULL   | NULL     | 1      | 2        |
> | 2      | 1        | NULL   | NULL     |
> | NULL   | NULL     | 2      | 1        |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query2 Result, current Hive
> -- Std MapJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 0        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 0        |
> | NULL   | NULL     | 1      | 1        |
> | 1      | 1        | 1      | 0        |
> | 1      | 1        | 1      | 1        |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 0        | NULL   | NULL     |
> | 1      | 1        | 1      | 1        |
> | 1      | 1        | 1      | 0        |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 0        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 0        |
> | 1      | 1        | 1      | 1        |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query2 Result, Hive with HIVE-27138 patch, disable 
> FullOuterMapJoinOptimization
> -- Std MapJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 1        | 1      | 1        |
> | 1      | 1        | 1      | 0        |
> | 1      | 0        | 1      | 1        |
> | 1      | 0        | 1      | 0        |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 0        | NULL   | NULL     |
> | 1      | 1        | 1      | 0        |
> | 1      | 1        | 1      | 1        |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | b.key  | b.value  | a.key  | a.value  |
> +--------+----------+--------+----------+
> | 1      | 0        | NULL   | NULL     |
> | NULL   | NULL     | 1      | 0        |
> | 1      | 1        | 1      | 1        |
> +--------+----------+--------+----------+ {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to