[jira] [Updated] (HIVE-24245) Vectorized PTF with count and distinct over partition producing incorrect results.

Chiran Ravani (Jira) Thu, 08 Oct 2020 14:40:15 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-24245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chiran Ravani updated HIVE-24245:
---------------------------------
    Description: 
Vectorized PTF for count and distinct over partition is broken. It produces 
incorrect results.
Below is the test case.

{code}
CREATE TABLE bigd781b_new (
  id int,
  txt1 string,
  txt2 string,
  cda_date int,
  cda_job_name varchar(12));

INSERT INTO bigd781b_new VALUES 
  (1,'2010005759','7164335675012038',20200528,'load1'),
  (2,'2010005759','7164335675012038',20200528,'load2');
{code}

Running below query produces incorrect results

{code}
SELECT
    txt1,
    txt2,
    count(distinct txt1) over(partition by txt1) as n,
    count(distinct txt2) over(partition by txt2) as m
FROM bigd781b_new
{code}

as below.

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 2  | 2  |
| 2010005759  | 7164335675012038  | 2  | 2  |
+-------------+-------------------+----+----+
{code}

While the correct output would be

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 1  | 1  |
| 2010005759  | 7164335675012038  | 1  | 1  |
+-------------+-------------------+----+----+
{code}


The problem does not appear after setting below property
set hive.vectorized.execution.ptf.enabled=false;


  was:
Vectorized PTF for count and distinct over partition is broken. It produces 
incorrect results.
Below is the test case.

{code}
CREATE TABLE bigd781b_new (
  id int,
  txt1 string,
  txt2 string,
  cda_date int,
  cda_job_name varchar(12));

INSERT INTO bigd781b_new VALUES 
  (1,'2010005759','7164335675012038',20200528,'load1'),
  (2,'2010005759','7164335675012038',20200528,'load2');
{code}

Running below query produces incorrect results

{code}
SELECT
    txt1,
    txt2,
    count(distinct txt1) over(partition by txt1) as n,
    count(distinct txt2) over(partition by txt2) as m
FROM bigd781b_new
WHERE cda_date = 20200528 and ( txt2 = '7164335675012038');
{code}

as below.

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 2  | 2  |
| 2010005759  | 7164335675012038  | 2  | 2  |
+-------------+-------------------+----+----+
{code}

While the correct output would be

{code}
+-------------+-------------------+----+----+
|    txt1     |       txt2        | n  | m  |
+-------------+-------------------+----+----+
| 2010005759  | 7164335675012038  | 1  | 1  |
| 2010005759  | 7164335675012038  | 1  | 1  |
+-------------+-------------------+----+----+
{code}


The problem does not appear after setting below property
set hive.vectorized.execution.ptf.enabled=false;



> Vectorized PTF with count and distinct over partition producing incorrect 
> results.
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-24245
>                 URL: https://issues.apache.org/jira/browse/HIVE-24245
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, PTF-Windowing, Vectorization
>    Affects Versions: 3.1.0, 3.1.2
>            Reporter: Chiran Ravani
>            Priority: Critical
>
> Vectorized PTF for count and distinct over partition is broken. It produces 
> incorrect results.
> Below is the test case.
> {code}
> CREATE TABLE bigd781b_new (
>   id int,
>   txt1 string,
>   txt2 string,
>   cda_date int,
>   cda_job_name varchar(12));
> INSERT INTO bigd781b_new VALUES 
>   (1,'2010005759','7164335675012038',20200528,'load1'),
>   (2,'2010005759','7164335675012038',20200528,'load2');
> {code}
> Running below query produces incorrect results
> {code}
> SELECT
>     txt1,
>     txt2,
>     count(distinct txt1) over(partition by txt1) as n,
>     count(distinct txt2) over(partition by txt2) as m
> FROM bigd781b_new
> {code}
> as below.
> {code}
> +-------------+-------------------+----+----+
> |    txt1     |       txt2        | n  | m  |
> +-------------+-------------------+----+----+
> | 2010005759  | 7164335675012038  | 2  | 2  |
> | 2010005759  | 7164335675012038  | 2  | 2  |
> +-------------+-------------------+----+----+
> {code}
> While the correct output would be
> {code}
> +-------------+-------------------+----+----+
> |    txt1     |       txt2        | n  | m  |
> +-------------+-------------------+----+----+
> | 2010005759  | 7164335675012038  | 1  | 1  |
> | 2010005759  | 7164335675012038  | 1  | 1  |
> +-------------+-------------------+----+----+
> {code}
> The problem does not appear after setting below property
> set hive.vectorized.execution.ptf.enabled=false;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24245) Vectorized PTF with count and distinct over partition producing incorrect results.

Reply via email to