[
https://issues.apache.org/jira/browse/ORC-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated ORC-763:
------------------------------
Fix Version/s: 1.6.8
> ORC timestamp inconsistencies before UNIX epoch
> -----------------------------------------------
>
> Key: ORC-763
> URL: https://issues.apache.org/jira/browse/ORC-763
> Project: ORC
> Issue Type: Bug
> Reporter: Zoltán Borók-Nagy
> Assignee: Gang Wu
> Priority: Critical
> Fix For: 1.7.0, 1.6.8
>
>
> I did some experiments with Hive (Java ORC 1.5.1) and Impala (C++ ORC 1.6.2)
> and found some bugs related to timestamps before the UNIX epoch.
> From Hive:
> 0: jdbc:hive2://localhost:11050/default> create table orc_ts (ts timestamp)
> stored as orc;
> 0: jdbc:hive2://localhost:11050/default> insert into orc_ts values
> ('1969-12-31 23:59:59'), ('1969-12-31 23:59:59.0001'), ('1969-12-31
> 23:59:59.001');
> 0: jdbc:hive2://localhost:11050/default> insert into orc_ts values
> ('1969-12-31 23:59:58'), ('1969-12-31 23:59:58.0001'), ('1969-12-31
> 23:59:58.001');
> 0: jdbc:hive2://localhost:11050/default> select * from orc_ts;
> +---------------------------+
> |orc_ts.ts|
> +---------------------------+
> |1969-12-31 23:59:59.0|
> |1969-12-31 23:59:59.0001|
> |1970-01-01 00:00:00.001|
> |1969-12-31 23:59:58.0|
> |1969-12-31 23:59:58.0001|
> |1969-12-31 23:59:58.001|
> +---------------------------+
> Please note that we inserted '1969-12-31 23:59:59.001' and we got
> '1970-01-01 00:00:00.001'. So Java ORC read/writes are inconsistent in
> themselves.
> From Impala:
>
> {noformat}
> [localhost:21050] default> select * from orc_ts;
> +-------------------------------+
> | ts |
> +-------------------------------+
> | 1969-12-31 23:59:59 |
> | 1969-12-31 23:59:58.000100000 |
> | 1970-01-01 00:00:00.001000000 |
> | 1969-12-31 23:59:58 |
> | 1969-12-31 23:59:57.000100000 |
> | 1969-12-31 23:59:58.001000000 |
> +-------------------------------+{noformat}
> From Impala the second and fifth timestamps are also off by one second. The
> third timestamp is also off by one second, but consistent with Java.
> https://issues.apache.org/jira/browse/ORC-306 mentions a Java bug that it ORC
> tries to workaround. Seems like data files store values in a way to
> workaround the Java issue which is unnecessary in C++.
> Looking at the code the Java and C++ code they construct timestamp values
> differently.
> C++:
> [https://github.com/apache/orc/blob/654f777bc34841eed3b340047308f8dac7f554db/c%2B%2B/src/ColumnReader.cc#L377-L379]
> {noformat}
> if (secsBuffer[i] < 0 && nanoBuffer[i] != 0) {
> secsBuffer[i] -= 1;
> }
> {noformat}
> Java:
> [https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1227-L1229]
>
> {noformat}
> if (millis < 0 && newNanos > 999_999) {
> millis -= TimestampTreeWriter.MILLIS_PER_SECOND;
> }
> {noformat}
> C++ 'checks for nanoBuffer[i] != 0' while Java checks for 'newNanos >
> 999_999'. Both only for timestamps before the epoch.
> This gives us a pattern when C++ and Java is inconsistent:
> * timestamps before the UNIX epoch, AND
> * have the format YYYY-MM-DD HH:MM:ss.000XXX
> I checked the actual values in the data files, written by ORC Java, read by
> ORC C++ lib:
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:59 => correct
> $1 = -1
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:59.0001 (orc
> C++)=> 1969-12-31 23:59:58.000100000
> $2 = -1
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:59.001 (orc
> C++)=> 1970-01-01 00:00:00.001000000
> $3 = 0{noformat}
> {noformat}
> (gdb) print secsBuffer[0] + epochOffset // 1969-12-31 23:59:58 => correct
> $9 = -2
> (gdb) print secsBuffer[1] + epochOffset // 1969-12-31 23:59:58.0001 (orc
> c++)=> 1969-12-31 23:59:57.000100000
> $10 = -2
> (gdb) print secsBuffer[2] + epochOffset // 1969-12-31 23:59:58.001 (orc
> c++)=> 1969-12-31 23:59:58.001000000
> $11 = -1{noformat}
> The seconds are the same for '1969-12-31 23:59:58' and '1969-12-31
> 23:59:58.0001', but differ for '1969-12-31 23:59:58.001'. I think this is a
> bug in the Java writer. The workaround for the Java bug (ORC-306) shouldn't
> have any effect on the written data files.
> So the values for the seconds are off by one when the timestamp is before the
> epoch and milliseconds are not zero.
> I think the data files should always store the values corresponding to the
> spec, i.e. number of seconds since the ORC epoch, plus additional nanoseconds
> that we need to add to the timestamp. If that'd be true then we wouldn't need
> the above 'if' statement in the c++ code.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)