Seonggon Namgung created HIVE-27226:
---------------------------------------
Summary: FullOuterJoin with filter expressions is not computed
correctly
Key: HIVE-27226
URL: https://issues.apache.org/jira/browse/HIVE-27226
Project: Hive
Issue Type: Bug
Reporter: Seonggon Namgung
I tested many OuterJoin queries as an extension of HIVE-27138, and I found that
Hive returns incorrect result for a query containing FullOuterJoin with filter
expressions. In a nutshell, all JoinOperators that run on Tez engine return
incorrect result for OuterJoin queries, and one of the reason for incorrect
computation comes from CommonJoinOperator, which is the base of all
JoinOperators. I attached the queries and configuration that I used at the
bottom of the document. I am still inspecting this problems, and I will share
an update once when I find out another reason. Also any comments and opinions
would be appreciated.
First of all, I observed that current Hive ignores filter expressions contained
in MapJoinOperator. For example, the attached result of query1 shows that
MapJoinOperator performs inner join, not full outer join. This problem stems
from removal of filterMap. When converting JoinOperator to MapJoinOperator,
ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() removes filterMap of
MapJoinOperator. Because MapJoinOperator does not evaluate filter expressions
if filterMap is null, this change makes MapJoinOperator ignore filter
expressions and it always joins tables regardless whether they satisfy filter
expressions or not. To solve this problem, I disable
FullOuterMapJoinOptimization and apply path for HIVE-27138, which prevents NPE.
(The patch is available at the following link: LINK.) The rest of this document
uses this modified Hive, but most of problems happen to current Hive, too.
The second problem I found is that Hive returns the same left-null or
right-null rows multiple time when it uses MapJoinOperator or
CommonMergeJoinOperator. This is caused by the logic of current
CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps.
First, they create RowContainers, each of which is a group of rows from one
table and has the same key. Second, they call
CommonJoinOperator#checkAndGenObject() with created RowContainers. This method
checks filterTag of each row in RowContainers and forwards joined row if they
meet all filter conditions. For OuterJoin, checkAndGenObject() forwards
non-matching rows if there is no matching row in RowContainer. The problem
happens when there are multiple RowContainer for the same key and table. For
example, suppose that there are two left RowContainers and one right
RowContainer. If none of the row in two left RowContainers satisfies filter
condition, then checkAndGenObject() will forward Left-Null row for each right
row. Because checkAndGenObject() is called with each left RowContainer, there
will be two duplicated Left-Null rows for every right row.
In the case of MapJoinOperator, it always creates singleton RowContainer for
big table. Therefore, it always produces duplicated non-matching rows.
CommonMergeJoinOperator also creates multiple RowContainer for big table, whose
size is hive.join.emit.interval. In the below experiment, I also set
hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to
disable specialized algorithm for OuterJoin of 2 tables and force calling
checkAndGenObject() before all rows with the same keys are gathered. I didn't
observe this problem when using VectorMapJoinOperator, and I will inspect
VectorMapJoinOperator whether we can reproduce the problem with it.
I think the second problem is not limited to FullOuterJoin, but I couldn't find
such query as of now. This will also be added to this issue if I can write a
query that reproduces the second problem without FullOuterJoin.
I also found that Hive returns wrong result for query2 even when I used
VectorMapJoinOperator. I am still inspecting this problem and I will add an
update on it when I find out the reason.
Experiment:
{code:java}
---- Configuration
set hive.optimize.shared.work=false;
-- Std MapJoin
set hive.auto.convert.join=true;
set hive.vectorized.execution.enabled=false;
-- Vec MapJoin
set hive.auto.convert.join=true;
set hive.vectorized.execution.enabled=true;
-- MergeJoin
set hive.auto.convert.join=false;
set hive.vectorized.execution.enabled=false;
set hive.join.shortcut.unmatched.rows=false;
set hive.join.emit.interval=1;
set hive.exec.reducers.max=1;
---- Queries
-- Query 1
DROP TABLE IF EXISTS a;
CREATE TABLE a (key string, value string);
INSERT INTO a VALUES (1, 1), (1, 2), (2, 1);
SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0;
-- Query 2
DROP TABLE IF EXISTS b;
CREATE TABLE b (key string, value string);
INSERT INTO b VALUES (1, 0), (1, 1);
SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND
b.value > 0;{code}
Experiment result:
{code:java}
-- PostgresSQL
-- Query1
key | value | key | value
-----+-------+-----+-------
1 | 1 | |
1 | 2 | |
2 | 1 | |
| | 1 | 2
| | 1 | 1
| | 2 | 1
(6 rows)
-- Query2
key | value | key | value
-----+-------+-----+-------
1 | 0 | |
1 | 1 | 1 | 1
| | 1 | 0
(3 rows){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)