[jira] [Commented] (ORC-763) ORC timestamp inconsistencies before UNIX epoch

Jira Mon, 22 Mar 2021 06:50:04 -0700


    [ 
https://issues.apache.org/jira/browse/ORC-763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306201#comment-17306201
 ]


Zoltán Borók-Nagy commented on ORC-763:
---------------------------------------

Thanks for the confirmation [~wgtmac].

What seems problematic to me is that the workaround to the java.sql.Timestamp 
bug got leaked into the written data files (instead of staying inside the Java 
ORC library), i.e. now the data files don't comply to the ORC specification. 
It's also the reason that we cannot distinguish between 1970-01-01 00.00.00.001 
and 1969-12-31 23.59.59.001.

I think the ideal solution would be to write data corresponding to the 
specification (there would be no ambiguous values either), and workaround any 
bugs inside the boundaries of the ORC reader/writer implementations. But I can 
see that there are probably a lot of ORC files that have been written this way, 
so maybe the best thing is really just to keep things this way (but 
consistently across engines), and extending the specification with these 
details?

> ORC timestamp inconsistencies before UNIX epoch
> -----------------------------------------------
>
>                 Key: ORC-763
>                 URL: https://issues.apache.org/jira/browse/ORC-763
>             Project: ORC
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Priority: Critical
>
> I did some experiments with Hive (Java ORC 1.5.1) and Impala (C++ ORC 1.6.2) 
> and found some bugs related to timestamps before the UNIX epoch.
> From Hive:
>  0: jdbc:hive2://localhost:11050/default> create table orc_ts (ts timestamp) 
> stored as orc;
>  0: jdbc:hive2://localhost:11050/default> insert into orc_ts values 
> ('1969-12-31 23:59:59'), ('1969-12-31 23:59:59.0001'), ('1969-12-31 
> 23:59:59.001');
>  0: jdbc:hive2://localhost:11050/default> insert into orc_ts values 
> ('1969-12-31 23:59:58'), ('1969-12-31 23:59:58.0001'), ('1969-12-31 
> 23:59:58.001');
>  0: jdbc:hive2://localhost:11050/default> select * from orc_ts;
>  +---------------------------+
> |orc_ts.ts|
> +---------------------------+
> |1969-12-31 23:59:59.0|
> |1969-12-31 23:59:59.0001|
> |1970-01-01 00:00:00.001|
> |1969-12-31 23:59:58.0|
> |1969-12-31 23:59:58.0001|
> |1969-12-31 23:59:58.001|
> +---------------------------+
>  Please note that we inserted '1969-12-31 23:59:59.001' and we got 
> '1970-01-01 00:00:00.001'. So Java ORC read/writes are inconsistent in 
> themselves.
> From Impala:
>  
> {noformat}
> [localhost:21050] default> select * from orc_ts;
> +-------------------------------+
> | ts                            |
> +-------------------------------+
> | 1969-12-31 23:59:59           |
> | 1969-12-31 23:59:58.000100000 |
> | 1970-01-01 00:00:00.001000000 |
> | 1969-12-31 23:59:58           |
> | 1969-12-31 23:59:57.000100000 |
> | 1969-12-31 23:59:58.001000000 |
> +-------------------------------+{noformat}
> From Impala the second and fifth timestamps are also off by one second. The 
> third timestamp is also off by one second, but consistent with Java.
> https://issues.apache.org/jira/browse/ORC-306 mentions a Java bug that it ORC 
> tries to workaround. Seems like data files store values in a way to 
> workaround the Java issue which is unnecessary in C++.
> Looking at the code the Java and C++ code they construct timestamp values 
> differently.
> C++:
> [https://github.com/apache/orc/blob/654f777bc34841eed3b340047308f8dac7f554db/c%2B%2B/src/ColumnReader.cc#L377-L379]
> {noformat}
> if (secsBuffer[i] < 0 && nanoBuffer[i] != 0) {
>   secsBuffer[i] -= 1;
> }
> {noformat}
> Java:
> [https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1227-L1229]
>  
> {noformat}
> if (millis < 0 && newNanos > 999_999) {
>   millis -= TimestampTreeWriter.MILLIS_PER_SECOND;
> }
> {noformat}
> C++ 'checks for nanoBuffer[i] != 0' while Java checks for 'newNanos > 
> 999_999'. Both only for timestamps before the epoch.
> This gives us a pattern when C++ and Java is inconsistent:
>  * timestamps before the UNIX epoch, AND
>  * have the format YYYY-MM-DD HH:MM:ss.000XXX
> I checked the actual values in the data files, written by ORC Java, read by 
> ORC C++ lib:
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:59 => correct
> $1 = -1
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:59.0001 (orc 
> C++)=> 1969-12-31 23:59:58.000100000
> $2 = -1
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:59.001 (orc 
> C++)=> 1970-01-01 00:00:00.001000000
> $3 = 0{noformat}
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:58 => correct
> $9 = -2
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:58.0001 (orc 
> c++)=> 1969-12-31 23:59:57.000100000
> $10 = -2
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:58.001 (orc 
> c++)=> 1969-12-31 23:59:58.001000000
> $11 = -1{noformat}
> The seconds are the same for '1969-12-31 23:59:58' and '1969-12-31 
> 23:59:58.0001', but differ for '1969-12-31 23:59:58.001'. I think this is a 
> bug in the Java writer. The workaround for the Java bug (ORC-306) shouldn't 
> have any effect on the written data files.
> So the values for the seconds are off by one when the timestamp is before the 
> epoch and milliseconds are not zero.
> I think the data files should always store the values corresponding to the 
> spec, i.e. number of seconds since the ORC epoch, plus additional nanoseconds 
> that we need to add to the timestamp. If that'd be true then we wouldn't need 
> the above 'if' statement in the c++ code.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ORC-763) ORC timestamp inconsistencies before UNIX epoch

Reply via email to