Chiran Ravani created HIVE-24245:
------------------------------------
Summary: Vectorized PTF with count and distinct over partition
producing incorrect results.
Key: HIVE-24245
URL: https://issues.apache.org/jira/browse/HIVE-24245
Project: Hive
Issue Type: Bug
Components: Hive, PTF-Windowing, Vectorization
Affects Versions: 3.1.2, 3.1.0
Reporter: Chiran Ravani
Vectorized PTF for count and distinct over partition is broken. It produces
incorrect results.
Below is the test case.
{code}
CREATE TABLE bigd781b_new (
id int,
txt1 string,
txt2 string,
cda_date int,
cda_job_name varchar(12));
INSERT INTO bigd781b_new VALUES
(1,'2010005759','7164335675012038',20200528,'load1'),
(2,'2010005759','7164335675012038',20200528,'load2');
{code}
Running below query produces incorrect results
{code}
SELECT
txt1,
txt2,
count(distinct txt1) over(partition by txt1) as n,
count(distinct txt2) over(partition by txt2) as m
FROM bigd781b_new
WHERE cda_date = 20200528 and ( txt2 = '7164335675012038');
{code}
as below.
{code}
+-------------+-------------------+----+----+
| txt1 | txt2 | n | m |
+-------------+-------------------+----+----+
| 2010005759 | 7164335675012038 | 2 | 2 |
| 2010005759 | 7164335675012038 | 2 | 2 |
+-------------+-------------------+----+----+
{code}
While the correct output would be
{code}
+-------------+-------------------+----+----+
| txt1 | txt2 | n | m |
+-------------+-------------------+----+----+
| 2010005759 | 7164335675012038 | 1 | 1 |
| 2010005759 | 7164335675012038 | 1 | 1 |
+-------------+-------------------+----+----+
{code}
The problem does not appear after setting below property
set hive.vectorized.execution.ptf.enabled=false;
--
This message was sent by Atlassian Jira
(v8.3.4#803005)