[ 
https://issues.apache.org/jira/browse/ORC-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated ORC-763:
------------------------------
    Priority: Critical  (was: Major)

> ORC timestamp inconsistencies before UNIX epoch
> -----------------------------------------------
>
>                 Key: ORC-763
>                 URL: https://issues.apache.org/jira/browse/ORC-763
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Priority: Critical
>
> I did some experiments with Hive (Java ORC 1.5.1) and Impala (C++ ORC 1.6.2) 
> and found some bugs related to timestamps before the UNIX epoch.
> From Hive:
>  0: jdbc:hive2://localhost:11050/default> create table orc_ts (ts timestamp) 
> stored as orc;
>  0: jdbc:hive2://localhost:11050/default> insert into orc_ts values 
> ('1969-12-31 23:59:59'), ('1969-12-31 23:59:59.0001'), ('1969-12-31 
> 23:59:59.001');
>  0: jdbc:hive2://localhost:11050/default> insert into orc_ts values 
> ('1969-12-31 23:59:58'), ('1969-12-31 23:59:58.0001'), ('1969-12-31 
> 23:59:58.001');
>  0: jdbc:hive2://localhost:11050/default> select * from orc_ts;
>  +---------------------------+
> |orc_ts.ts|
> +---------------------------+
> |1969-12-31 23:59:59.0|
> |1969-12-31 23:59:59.0001|
> |1970-01-01 00:00:00.001|
> |1969-12-31 23:59:58.0|
> |1969-12-31 23:59:58.0001|
> |1969-12-31 23:59:58.001|
> +---------------------------+
>  Please note that we inserted '1969-12-31 23:59:59.001' and we got 
> '1970-01-01 00:00:00.001'. So Java ORC read/writes are inconsistent in 
> themselves.
> From Impala:
>  
> {noformat}
> [localhost:21050] default> select * from orc_ts;
> +-------------------------------+
> | ts                            |
> +-------------------------------+
> | 1969-12-31 23:59:59           |
> | 1969-12-31 23:59:58.000100000 |
> | 1970-01-01 00:00:00.001000000 |
> | 1969-12-31 23:59:58           |
> | 1969-12-31 23:59:57.000100000 |
> | 1969-12-31 23:59:58.001000000 |
> +-------------------------------+{noformat}
> From Impala the second and fifth timestamps are also off by one second. The 
> third timestamp is also off by one second, but consistent with Java.
> https://issues.apache.org/jira/browse/ORC-306 mentions a Java bug that it ORC 
> tries to workaround. Seems like data files store values in a way to 
> workaround the Java issue which is unnecessary in C++.
> Looking at the code the Java and C++ code they construct timestamp values 
> differently.
> C++:
> [https://github.com/apache/orc/blob/654f777bc34841eed3b340047308f8dac7f554db/c%2B%2B/src/ColumnReader.cc#L377-L379]
> {noformat}
> if (secsBuffer[i] < 0 && nanoBuffer[i] != 0) {
>   secsBuffer[i] -= 1;
> }
> {noformat}
> Java:
> [https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1227-L1229]
>  
> {noformat}
> if (millis < 0 && newNanos > 999_999) {
>   millis -= TimestampTreeWriter.MILLIS_PER_SECOND;
> }
> {noformat}
> C++ 'checks for nanoBuffer[i] != 0' while Java checks for 'newNanos > 
> 999_999'. Both only for timestamps before the epoch.
> This gives us a pattern when C++ and Java is inconsistent:
>  * timestamps before the UNIX epoch, AND
>  * have the format YYYY-MM-DD HH:MM:ss.000XXX
> I checked the actual values in the data files, written by ORC Java, read by 
> ORC C++ lib:
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:59 => correct
> $1 = -1
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:59.0001 (orc 
> C++)=> 1969-12-31 23:59:58.000100000
> $2 = -1
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:59.001 (orc 
> C++)=> 1970-01-01 00:00:00.001000000
> $3 = 0{noformat}
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:58 => correct
> $9 = -2
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:58.0001 (orc 
> c++)=> 1969-12-31 23:59:57.000100000
> $10 = -2
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:58.001 (orc 
> c++)=> 1969-12-31 23:59:58.001000000
> $11 = -1{noformat}
> The seconds are the same for '1969-12-31 23:59:58' and '1969-12-31 
> 23:59:58.0001', but differ for '1969-12-31 23:59:58.001'. I think this is a 
> bug in the Java writer. The workaround for the Java bug (ORC-306) shouldn't 
> have any effect on the written data files.
> So the values for the seconds are off by one when the timestamp is before the 
> epoch and milliseconds are not zero.
> I think the data files should always store the values corresponding to the 
> spec, i.e. number of seconds since the ORC epoch, plus additional nanoseconds 
> that we need to add to the timestamp. If that'd be true then we wouldn't need 
> the above 'if' statement in the c++ code.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to