Vineet Garg created HIVE-21481:
----------------------------------
Summary: MERGE correctness issues with null safe equality
Key: HIVE-21481
URL: https://issues.apache.org/jira/browse/HIVE-21481
Project: Hive
Issue Type: Bug
Components: Query Planning
Reporter: Vineet Garg
The way Hive currently generates plan for MERGE statement can lead to wrong
results with null safe equality.
To illustrate consider the following reproducer
{code:sql}
create table ttarget(s string, j int, flag string) stored as orc
tblproperties("transactional"="true");
truncate table ttarget;
insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');
create table tsource (i int);
insert into tsource values(null),(2);
{code}
Let's say you have the following MERGE statement
{code:sql}
explain merge into ttarget using tsource on i<=>j
when matched THEN
UPDATE set flag='updated'
when not matched THEN
INSERT VALUES('new', 1999, 'true');
{code}
With this MERGE {{*ONLY ONE*}} row should match in target which should be
updated. But currently due to the plan hive generate it will end up matching
both rows.
This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER
corresponding to all branches.
The part of the plan generated by hive for this statement consist of:
{noformat}
Map 2
Map Operator Tree:
TableScan
alias: tsource
Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
Column stats: NONE
Map Join Operator
condition map:
Right Outer Join 0 to 1
keys:
0 j (type: int)
1 i (type: int)
nullSafes: [true]
outputColumnNames: _col0, _col1, _col5, _col6
input vertices:
0 Map 1
Statistics: Num rows: 1 Data size: 206 Basic stats:
COMPLETE Column stats: NONE
HybridGraceHashJoin: true
Filter Operator
predicate: (_col6 IS NOT DISTINCT FROM _col1) (type:
boolean)
Statistics: Num rows: 1 Data size: 206 Basic stats:
COMPLETE Column stats: NONE
Select Operator
expressions: _col5 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string), _col1
(type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 206 Basic stats:
COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type:
struct<writeid:bigint,bucketid:int,rowid:bigint>)
sort order: +
Map-reduce partition columns: UDFToInteger(_col0)
(type: int)
Statistics: Num rows: 1 Data size: 206 Basic stats:
COMPLETE Column stats: NONE
value expressions: _col1 (type: string), _col2 (type:
int)
{noformat}
Result after JOIN will be :
{code:sql}
select s,j,i from ttarget right outer join tsource on i<=>j ;
NULL NULL NULL
NULL NULL 2
{code}
On this resultset predicate {{(_col6 IS NOT DISTINCT FROM _col1)}} will be true
for both resulting into both rows matching.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)