Stamatis Zampetakis created HIVE-26270:
------------------------------------------

             Summary: Wrong timestamps when reading Hive 3.1.x Parquet files 
with vectorized reader
                 Key: HIVE-26270
                 URL: https://issues.apache.org/jira/browse/HIVE-26270
             Project: Hive
          Issue Type: Bug
          Components: HiveServer2, Parquet
            Reporter: Stamatis Zampetakis
            Assignee: Stamatis Zampetakis


Parquet files written in Hive 3.1.x onwards with timezone set to US/Pacific.
{code:sql}
CREATE TABLE employee (eid INT, birth timestamp) STORED AS PARQUET;

INSERT INTO employee VALUES 
(1, '1880-01-01 00:00:00'),
(2, '1884-01-01 00:00:00'),
(3, '1990-01-01 00:00:00');
{code}
Parquet files read with Hive 4.0.0-apha-1 onwards.

+Without vectorization+ results are correct.
{code:sql}
SELECT * FROM employee;
{code}
{noformat}
1       1880-01-01 00:00:00
2       1884-01-01 00:00:00
3       1990-01-01 00:00:00
{noformat}
+With vectorization+ some timestamps are shifted.
{code:sql}
-- Disable fetch task conversion to force vectorization kick in
set hive.fetch.task.conversion=none;
SELECT * FROM employee;
{code}
{noformat}
1       1879-12-31 23:52:58
2       1884-01-01 00:00:00
3       1990-01-01 00:00:00
{noformat}
The problem is the same reported under HIVE-24074. The data were written using 
the new Date/Time APIs (java.time) in version Hive 3.1.3 and here they were 
read using the old APIs (java.sql).

The difference with HIVE-24074 is that here the problem appears only for 
vectorized execution while the non-vectorized reader is working fine so there 
is some *inconsistency in the behavior* of vectorized and non vectorized 
readers.

Non-vectorized reader works fine cause it derives automatically that it should 
use the new JDK APIs to read back the timestamp value. This is possible in this 
case cause there are metadata information in the file (i.e., the presence of 
{{{}writer.time.zone{}}}) from where it can infer that the timestamps were 
written using the new Date/Time APIs.

The inconsistent behavior between vectorized and non-vectorized reader is a 
regression caused by HIVE-25104. This JIRA is an attempt to re-align the 
behavior between vectorized and non-vectorized readers.

Note that if the file metadata are empty both vectorized and non-vectorized 
reader cannot determine which APIs to use for the conversion and in this case 
it is necessary the user to set the
{{hive.parquet.timestamp.legacy.conversion.enabled}} explicitly to get back the 
correct results.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to