Sebastian Klemke created HIVE-24033: ---------------------------------------
Summary: full outer join returns wrong number of results if hive.optimize.joinreducededuplication is enabled Key: HIVE-24033 URL: https://issues.apache.org/jira/browse/HIVE-24033 Project: Hive Issue Type: Bug Reporter: Sebastian Klemke We encountered a hive query that returns incorrect results, when joining two CTEs on a group by value. The input tables `id_table` and `reference_table` are unfortunately too large to share and on smaller tables we have not been able to reproduce. {code} WITH ids AS ( SELECT record.id AS id FROM `id_table` LATERAL VIEW explode(records) r AS record WHERE record.id = '5ef0bad74d325f72f0360c19' LIMIT 1 ), refs AS ( SELECT reference['id'] AS referenceId FROM `reference_table` WHERE partition_date = '2020-06-24' AND type = '1b0e9eb5c492d1859815410253dd79b5' AND reference['id'] = '5ef0bad74d325f72f0360c19' GROUP BY reference['id'] ) SELECT l.id AS id , r.referenceId AS referenceId FROM ids l FULL OUTER JOIN refs r ON l.id = r.referenceId {code} This returns 2 rows, because the join clause misses: {code} OK 5ef0bad74d325f72f0360c19 NULL NULL 5ef0bad74d325f72f0360c19 {code} Instead, a single row should be returned. The correct behavior can be achieved by either * calling lower() on the refs group by statement (doesn't change the string contents) * setting hive.optimize.joinreducededuplication=false -- This message was sent by Atlassian Jira (v8.3.4#803005)