Sebastian Klemke created HIVE-24033:
---------------------------------------
Summary: full outer join returns wrong number of results if
hive.optimize.joinreducededuplication is enabled
Key: HIVE-24033
URL: https://issues.apache.org/jira/browse/HIVE-24033
Project: Hive
Issue Type: Bug
Reporter: Sebastian Klemke
We encountered a hive query that returns incorrect results, when joining two
CTEs on a group by value. The input tables `id_table` and
`reference_table` are unfortunately too large to share and on smaller tables we
have not been able to reproduce.
{code}
WITH ids AS (
SELECT
record.id AS id
FROM
`id_table`
LATERAL VIEW explode(records) r AS record
WHERE
record.id = '5ef0bad74d325f72f0360c19'
LIMIT 1
),
refs AS (
SELECT
reference['id'] AS referenceId
FROM
`reference_table`
WHERE
partition_date = '2020-06-24'
AND type = '1b0e9eb5c492d1859815410253dd79b5'
AND reference['id'] = '5ef0bad74d325f72f0360c19'
GROUP BY
reference['id']
)
SELECT
l.id AS id
, r.referenceId AS referenceId
FROM
ids l
FULL OUTER JOIN
refs r
ON
l.id = r.referenceId
{code}
This returns 2 rows, because the join clause misses:
{code}
OK
5ef0bad74d325f72f0360c19 NULL
NULL 5ef0bad74d325f72f0360c19
{code}
Instead, a single row should be returned. The correct behavior can be achieved
by either
* calling lower() on the refs group by statement (doesn't change the string
contents)
* setting hive.optimize.joinreducededuplication=false
--
This message was sent by Atlassian Jira
(v8.3.4#803005)