Soumyakanti Das created HIVE-29122:
--------------------------------------
Summary: Vectorization - IGNORE NULLS not supported for
FIRST_VALUE and LAST_VALUE
Key: HIVE-29122
URL: https://issues.apache.org/jira/browse/HIVE-29122
Project: Hive
Issue Type: Bug
Components: Vectorization
Affects Versions: 4.2.0
Reporter: Soumyakanti Das
Assignee: Soumyakanti Das
Currently we get wrong results for FIRST_VALUE and LAST_VALUE functions with
IGNORE NULLS when vectorization is enabled.
Consider this table:
{noformat}
CREATE TABLE window_double_test (
id INT,
double_col DOUBLE
);
INSERT INTO window_double_test VALUES
(1, NULL),
(2, NULL),
(1, NULL),
(1, NULL),
(2, 25.5),
(3, NULL),
(6, 65.5),
(2, NULL),
(3, 32.5),
(4, NULL),
(3, 30.5),
(5, 50.5),
(4, 42.3),
(5, NULL),
(6, 65.2),
(7, NULL),
(NULL, 80.5),
(NULL, NULL);{noformat}
Non-vectorized:
{noformat}
set hive.vectorized.execution.enabled=false;
SELECT id, double_col,
FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
FROM window_double_test;
id double_col first_double last_double
1 NULL NULL NULL
1 NULL NULL NULL
1 NULL NULL NULL
2 NULL 25.5 25.5
2 25.5 25.5 25.5
2 NULL 25.5 25.5
3 NULL 25.5 32.5
3 30.5 25.5 32.5
3 32.5 25.5 32.5
4 NULL 25.5 42.3
4 42.3 25.5 42.3
5 50.5 25.5 50.5
5 NULL 25.5 50.5
6 65.2 25.5 65.5
6 65.5 25.5 65.5
7 NULL 25.5 65.5
NULL NULL 25.5 80.5
NULL 80.5 25.5 80.5{noformat}
Vectorized:
{noformat}
set hive.vectorized.execution.enabled=true;
SELECT id, double_col,
FIRST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS first_double,
LAST_VALUE(double_col) IGNORE NULLS OVER(ORDER BY id) AS last_double
FROM window_double_test;
id double_col first_double last_double
1 NULL NULL NULL
1 NULL NULL NULL
1 NULL NULL NULL
2 NULL NULL NULL
2 25.5 NULL NULL
2 NULL NULL NULL
3 NULL NULL 32.5
3 30.5 NULL 32.5
3 32.5 NULL 32.5
4 NULL NULL 42.3
4 42.3 NULL 42.3
5 50.5 NULL NULL
5 NULL NULL NULL
6 65.2 NULL 65.5
6 65.5 NULL 65.5
7 NULL NULL NULL
NULL NULL NULL 80.5
NULL 80.5 NULL 80.5{noformat}
We can see the differences for id = 2, for example. Similar results can also be
seen for int and decimal columns too. We don't see the same for a String column
as only LONG, DECIMAL, and DOUBLE are supported, as can be seen in
Vectorizer#validatePTFOperator.
There are other scenarios where it bypasses the Vectorizer, for example, if we
explicitly use ROWS instead of the default RANGE, etc. Anything that skips
vectorization produces correct results.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)