[
https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956855#comment-17956855
]
Seonggon Namgung commented on HIVE-27226:
-----------------------------------------
[~dkuzmenko] , Yes, we need to merge the quick-fix patch into master or another
branch that is dedicated to 4.1.0 release.
> FullOuterJoin with filter expressions is not computed correctly
> ---------------------------------------------------------------
>
> Key: HIVE-27226
> URL: https://issues.apache.org/jira/browse/HIVE-27226
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 4.0.0-beta-1
> Reporter: Seonggon Namgung
> Priority: Major
> Labels: hive-4.1.0-must, known_issue
>
> I tested many OuterJoin queries as an extension of HIVE-27138, and I found
> that Hive returns incorrect result for a query containing FullOuterJoin with
> filter expressions. In a nutshell, all JoinOperators that run on Tez engine
> return incorrect result for OuterJoin queries, and one of the reason for
> incorrect computation comes from CommonJoinOperator, which is the base of all
> JoinOperators. I attached the queries and configuration that I used at the
> bottom of the document. I am still inspecting this problems, and I will share
> an update once when I find out another reason. Also any comments and opinions
> would be appreciated.
> First of all, I observed that current Hive ignores filter expressions
> contained in MapJoinOperator. For example, the attached result of query1
> shows that MapJoinOperator performs inner join, not full outer join. This
> problem stems from removal of filterMap. When converting JoinOperator to
> MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin()
> removes filterMap of MapJoinOperator. Because MapJoinOperator does not
> evaluate filter expressions if filterMap is null, this change makes
> MapJoinOperator ignore filter expressions and it always joins tables
> regardless whether they satisfy filter expressions or not. To solve this
> problem, I disable FullOuterMapJoinOptimization and apply path for
> HIVE-27138, which prevents NPE. (The patch is available at the following
> link: LINK.) The rest of this document uses this modified Hive, but most of
> problems happen to current Hive, too.
> The second problem I found is that Hive returns the same left-null or
> right-null rows multiple time when it uses MapJoinOperator or
> CommonMergeJoinOperator. This is caused by the logic of current
> CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps.
> First, they create RowContainers, each of which is a group of rows from one
> table and has the same key. Second, they call
> CommonJoinOperator#checkAndGenObject() with created RowContainers. This
> method checks filterTag of each row in RowContainers and forwards joined row
> if they meet all filter conditions. For OuterJoin, checkAndGenObject()
> forwards non-matching rows if there is no matching row in RowContainer. The
> problem happens when there are multiple RowContainer for the same key and
> table. For example, suppose that there are two left RowContainers and one
> right RowContainer. If none of the row in two left RowContainers satisfies
> filter condition, then checkAndGenObject() will forward Left-Null row for
> each right row. Because checkAndGenObject() is called with each left
> RowContainer, there will be two duplicated Left-Null rows for every right row.
> In the case of MapJoinOperator, it always creates singleton RowContainer for
> big table. Therefore, it always produces duplicated non-matching rows.
> CommonMergeJoinOperator also creates multiple RowContainer for big table,
> whose size is hive.join.emit.interval. In the below experiment, I also set
> hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to
> disable specialized algorithm for OuterJoin of 2 tables and force calling
> checkAndGenObject() before all rows with the same keys are gathered. I didn't
> observe this problem when using VectorMapJoinOperator, and I will inspect
> VectorMapJoinOperator whether we can reproduce the problem with it.
> I think the second problem is not limited to FullOuterJoin, but I couldn't
> find such query as of now. This will also be added to this issue if I can
> write a query that reproduces the second problem without FullOuterJoin.
> I also found that Hive returns wrong result for query2 even when I used
> VectorMapJoinOperator. I am still inspecting this problem and I will add an
> update on it when I find out the reason.
>
> Experiment:
>
> {code:java}
> ---- Configuration
> set hive.optimize.shared.work=false;
> -- Std MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=false;
> -- Vec MapJoin
> set hive.auto.convert.join=true;
> set hive.vectorized.execution.enabled=true;
> -- MergeJoin
> set hive.auto.convert.join=false;
> set hive.vectorized.execution.enabled=false;
> set hive.join.shortcut.unmatched.rows=false;
> set hive.join.emit.interval=1;
> set hive.exec.reducers.max=1;
>
> ---- Queries
> -- Query 1
> DROP TABLE IF EXISTS a;
> CREATE TABLE a (key string, value string);
> INSERT INTO a VALUES (1, 1), (1, 2), (2, 1);
> SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0;
> -- Query 2
> DROP TABLE IF EXISTS b;
> CREATE TABLE b (key string, value string);
> INSERT INTO b VALUES (1, 0), (1, 1);
> SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND
> b.value > 0;{code}
>
>
> Experiment result:
>
> {code:java}
> -- PostgresSQL
> -- Query1
> key | value | key | value
> -----+-------+-----+-------
> 1 | 1 | |
> 1 | 2 | |
> 2 | 1 | |
> | | 1 | 2
> | | 1 | 1
> | | 2 | 1
> (6 rows)
> -- Query2
> key | value | key | value
> -----+-------+-----+-------
> 1 | 0 | |
> 1 | 1 | 1 | 1
> | | 1 | 0
> (3 rows){code}
> {code:java}
> -- Query1 Result, current Hive
> -- Std MapJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 2 | 1 | 2 | 1 |
> | 1 | 2 | 1 | 2 |
> | 1 | 2 | 1 | 1 |
> | 1 | 1 | 1 | 2 |
> | 1 | 1 | 1 | 1 |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 1 | 2 | NULL | NULL |
> | 1 | 1 | NULL | NULL |
> | NULL | NULL | 1 | 2 |
> | NULL | NULL | 1 | 1 |
> | 2 | 1 | NULL | NULL |
> | NULL | NULL | 2 | 1 |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 1 | 1 | NULL | NULL |
> | NULL | NULL | 1 | 1 |
> | NULL | NULL | 1 | 2 |
> | 1 | 2 | NULL | NULL |
> | NULL | NULL | 1 | 1 |
> | NULL | NULL | 1 | 2 |
> | 2 | 1 | NULL | NULL |
> | NULL | NULL | 2 | 1 |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query1 Result, Hive with HIVE-27138 patch, disable
> FullOuterMapJoinOptimization
> -- Std MapJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 2 | 1 | NULL | NULL |
> | NULL | NULL | 2 | 1 |
> | 1 | 2 | NULL | NULL |
> | NULL | NULL | 1 | 2 |
> | NULL | NULL | 1 | 1 |
> | 1 | 1 | NULL | NULL |
> | NULL | NULL | 1 | 2 |
> | NULL | NULL | 1 | 1 |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 1 | 2 | NULL | NULL |
> | 1 | 1 | NULL | NULL |
> | NULL | NULL | 1 | 2 |
> | NULL | NULL | 1 | 1 |
> | 2 | 1 | NULL | NULL |
> | NULL | NULL | 2 | 1 |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | a.key | a.value | b.key | b.value |
> +--------+----------+--------+----------+
> | 1 | 1 | NULL | NULL |
> | NULL | NULL | 1 | 1 |
> | NULL | NULL | 1 | 2 |
> | 1 | 2 | NULL | NULL |
> | NULL | NULL | 1 | 1 |
> | NULL | NULL | 1 | 2 |
> | 2 | 1 | NULL | NULL |
> | NULL | NULL | 2 | 1 |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query2 Result, current Hive
> -- Std MapJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 0 | NULL | NULL |
> | NULL | NULL | 1 | 0 |
> | NULL | NULL | 1 | 1 |
> | 1 | 1 | 1 | 0 |
> | 1 | 1 | 1 | 1 |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 0 | NULL | NULL |
> | 1 | 1 | 1 | 1 |
> | 1 | 1 | 1 | 0 |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 0 | NULL | NULL |
> | NULL | NULL | 1 | 0 |
> | 1 | 1 | 1 | 1 |
> +--------+----------+--------+----------+ {code}
> {code:java}
> -- Query2 Result, Hive with HIVE-27138 patch, disable
> FullOuterMapJoinOptimization
> -- Std MapJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 1 | 1 | 1 |
> | 1 | 1 | 1 | 0 |
> | 1 | 0 | 1 | 1 |
> | 1 | 0 | 1 | 0 |
> +--------+----------+--------+----------+
> -- Vec MapJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 0 | NULL | NULL |
> | 1 | 1 | 1 | 0 |
> | 1 | 1 | 1 | 1 |
> +--------+----------+--------+----------+
> -- MergeJoin
> +--------+----------+--------+----------+
> | b.key | b.value | a.key | a.value |
> +--------+----------+--------+----------+
> | 1 | 0 | NULL | NULL |
> | NULL | NULL | 1 | 0 |
> | 1 | 1 | 1 | 1 |
> +--------+----------+--------+----------+ {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)