[ https://issues.apache.org/jira/browse/TEZ-4402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Youjun Yuan updated TEZ-4402: ----------------------------- Description: we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table. The query is: {code:java} SELECT d.id, u.id FROM ( SELECT id FROM airflow.tableA rud WHERE rud.dt = '2022-04-02-1row' ) d FULL JOIN ( SELECT id FROM default.tableB WHERE dt = '2022-04-01' and device_token='blabla' ) u ON u.id = d.id ; {code} And produces two records for id=350570497 {code:java} 350570497 NULL NULL 350570497 Time taken: 62.692 seconds, Fetched: 2 row(s) {code} I am sure tableB has only one row where device_token='blabla' And we tried: 1, SET mapreduce.job.reduces=1; then it produces right result; 2, SET hive.execution.engine=mr; then it produces right result; 3, switch to tez 0.9.2, then it produces right result; 4, *JOIN* (instead of FULL JOIN) worked as expected 5, in sub query u, change filter _device_token='blabla'_ to {_}id=350570497{_}, it worked ok I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~10m rows, but if I further reduce the size of tableB, then the issue can't be reproduced. any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. was: we hit a query which FULL JOINs two tables, hive produces incorrect results, for a single value of join key, it produces two records, each record has a valid value for one table and NULL for the other table. The query is: {code:java} SELECT d.id, u.id FROM ( SELECT id FROM airflow.tableA rud WHERE rud.dt = '2022-04-02-1row' ) d FULL JOIN ( SELECT id FROM default.tableB WHERE dt = '2022-04-01' and device_token='blabla' ) u ON u.id = d.id ; {code} And produces two records for id=350570497 {code:java} 350570497 NULL NULL 350570497 Time taken: 62.692 seconds, Fetched: 2 row(s) {code} I am sure tableB has only one row where device_token='blabla' And we tried: 1, SET mapreduce.job.reduces=1; then it produces right result; 2, SET hive.execution.engine=mr; then it produces right result; 3, switch to tez 0.9.1, then it produces right result; 4, *JOIN* (instead of FULL JOIN) worked as expected 5, in sub query u, change filter _device_token='blabla'_ to {_}id=350570497{_}, it worked ok I can't generate a small enough result set to reproduce the issue, I have minimized the tableA to only 1 row, tableB has ~10m rows, but if I further reduce the size of tableB, then the issue can't be reproduced. any suggestion would be highly appreciated, regarding the root cause of the issue, how to work around it, or how to reproduce it with small enough dataset. > failed to do FULL JOIN tez 0.10.1 > --------------------------------- > > Key: TEZ-4402 > URL: https://issues.apache.org/jira/browse/TEZ-4402 > Project: Apache Tez > Issue Type: Bug > Environment: EMR (hive 3.1.2 + Tez 0.10.1) > Data stored in S3 > > Reporter: Youjun Yuan > Priority: Major > > we hit a query which FULL JOINs two tables, hive produces incorrect results, > for a single value of join key, it produces two records, each record has a > valid value for one table and NULL for the other table. > The query is: > > {code:java} > SELECT d.id, u.id > FROM ( > SELECT id > FROM airflow.tableA rud > WHERE rud.dt = '2022-04-02-1row' > ) d > FULL JOIN ( > SELECT id > FROM default.tableB > WHERE dt = '2022-04-01' and device_token='blabla' > ) u > ON u.id = d.id > ; > {code} > And produces two records for id=350570497 > > > {code:java} > 350570497 NULL > NULL 350570497 > Time taken: 62.692 seconds, Fetched: 2 row(s) {code} > I am sure tableB has only one row where device_token='blabla' > And we tried: > 1, SET mapreduce.job.reduces=1; then it produces right result; > 2, SET hive.execution.engine=mr; then it produces right result; > 3, switch to tez 0.9.2, then it produces right result; > 4, *JOIN* (instead of FULL JOIN) worked as expected > 5, in sub query u, change filter _device_token='blabla'_ to > {_}id=350570497{_}, it worked ok > > I can't generate a small enough result set to reproduce the issue, I have > minimized the tableA to only 1 row, tableB has ~10m rows, but if I further > reduce the size of tableB, then the issue can't be reproduced. > > any suggestion would be highly appreciated, regarding the root cause of the > issue, how to work around it, or how to reproduce it with small enough > dataset. > -- This message was sent by Atlassian Jira (v8.20.1#820001)