Guillaume created HIVE-22500:
--------------------------------
Summary: sort by and timestamp casting filters out too many rows
Key: HIVE-22500
URL: https://issues.apache.org/jira/browse/HIVE-22500
Project: Hive
Issue Type: Bug
Components: Hive
Affects Versions: 3.1.1
Reporter: Guillaume
Consider this example.
Preparation:
{code:java}
create temporary table opens as (
select stack(1,
1 , cast('2019-11-13 08:07:28' as timestamp)
) as (id , load_ts )
);
{code}
Queries. This is just about counting the number of rows, with filters always
matching, and possibly sort by. 1 is always expected.
{code:java}
select count(*) from ( select * from opens) t;
select count(*) from ( select * from opens sort by id) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13
08:07:00' ) t;
select count(*) from ( select * from opens where load_ts >= '2019-11-13
08:07:00' sort by id) t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13
09:07:00') t;
select count(*) from ( select * from opens where load_ts <= '2019-11-13
09:07:00' sort by id) t;
{code}
The latest query (_sort by_ and _<=_ on timestamp) returns 0 rows.
I believe that this is the cause of other issues I have, where I have missing
rows in queries with the timestamp (but not the explicit sort by).
Note that if instead of a temporary table I use a CTE for opens, the issue does
not appear.
I tried workarounds (inverse order of operands, adding _not_ or _not not_ ) to
no avail.
One thing that did work is to explicitly cast the string to a timestamp:
{code:java}
select count(*) from ( select * from opens where load_ts <= cast('2019-11-13
09:07:00' as timestamp) sort by id) t;{code}
It might be good practice indeed, but there still is a discrepancy between how
_>=_ and _<=_ are handled, or how _sort by_ works.
Note: this is on Hive from
[hdp3.1.4|[https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/release-notes/content/patch_hive.html]],
without llap.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)