to Orc uses UTC time zone

GitBox Tue, 30 Nov 2021 16:45:04 -0800


bersprockets commented on a change in pull request #34741:
URL: https://github.com/apache/spark/pull/34741#discussion_r759707950




##########
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
##########
@@ -531,13 +533,16 @@ object OrcUtils extends Logging {
   }
 
   def fromOrcNTZ(ts: Timestamp): Long = {
-    DateTimeUtils.millisToMicros(ts.getTime) +
+    val utcMicros = DateTimeUtils.millisToMicros(ts.getTime) +
       (ts.getNanos / NANOS_PER_MICROS) % MICROS_PER_MILLIS
+    val micros = DateTimeUtils.fromUTCTime(utcMicros, 
TimeZone.getDefault.getID)
+    micros
   }
 
   def toOrcNTZ(micros: Long): OrcTimestamp = {
-    val seconds = Math.floorDiv(micros, MICROS_PER_SECOND)
-    val nanos = (micros - seconds * MICROS_PER_SECOND) * NANOS_PER_MICROS
+    val utcMicros = DateTimeUtils.toUTCTime(micros, TimeZone.getDefault.getID)

Review comment:
       >I'm trying to understand this issue better. From the ORC source code, 
seems like
   
   Almost: For the first point, the ORC writer stores the timestamp value 
passed by the caller as-is, and records the timezone in file footer (that's why 
we need to shift the value before passing it to ORC).
   
   The devil, however, is in the details. _I know enough to know there are 
issues with the above, but not enough to know every issue or solution_.
   
   For example, an offset for a particular time zone is not fixed. It changes 
depending on the point on the timeline.
   
   When shifting, ORC determines the offsets for the two time zones based on 
the stored timestamp value (millis) ([see 
here](https://github.com/apache/orc/blob/334bf1f2c605f38c7e75ec81d1dab93c31fc8459/java/core/src/java/org/apache/orc/impl/SerializationUtils.java#L1444)).
 Because ORC uses the old time APIs to do this, we should ensure we pass a 
correct Hybrid Julian epoch time to ORC (via a Timestamp object), or ORC could 
get the wrong offsets.
   
   Because of that, I don't think you can pass an arbitrary long value in the 
Timestamp object and hope to properly reconstruct it on read. You might not 
know how ORC shifted it.
   
   In addition, the fromUTCTime/toUTCTime/convertTZ utilty methods work 
correctly, but are not appropriate for our needs here. 
fromUTCTime/toUTCTime/convertTZ uses the new Java time APIs, which don't work 
as we'd like for timestamps before the introduction of Railway time (circa 
1883-11-17).
   
   Emulating here what fromUTCTime does, you can see the issue here:
   
   ```
   scala> val ldtUtc = ZonedDateTime.of(1883, 11, 16, 0, 0, 0, 0, 
ZoneId.of("UTC")).toLocalDateTime
   ldtUtc: java.time.LocalDateTime = 1883-11-16T00:00
   
   scala> val zdtShifted = ldtUtc.atZone(ZoneId.of("America/Los_Angeles"))
   zdtShifted: java.time.ZonedDateTime = 
1883-11-16T00:00-07:52:58[America/Los_Angeles]
   
   scala>
   ```
   Note how the time zone that the new APIs applied to the shifted value is 
-07:52:58, which is not some offset evenly divisible by 1 hour or 30 minutes.
   
   As a result, you end up with a timestamp like this:
   ```
   scala> val ts = new Timestamp(zdtShifted.toInstant.toEpochMilli)
   ts: java.sql.Timestamp = 1883-11-15 23:52:58.0
   
   scala> 
   ```
   In fact, you can see a compounded version of this with this PR:
   ```
   scala> TimeZone.setDefault(TimeZone.getTimeZone("UTC"))
   
   scala> sql("select timestamp_ntz'1883-11-16 00:00:00.0' as 
ts_ntz").write.mode("overwrite").orc("test")
   
   scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
   
   scala> spark.read.orc("test").show(false)
   +-------------------+
   |ts_ntz             |
   +-------------------+
   |1883-11-16 00:07:02|
   +-------------------+
   
   scala> 
   ```
   Edit: I assume the above is because of pre-railway shifting, I didn't verify 
that.
   
   I updated my 
[POC](https://github.com/apache/spark/compare/master...bersprockets:orc_ntz_issue_play)
 to attempt to accomodate for these issues by:
   
   - Stealing the timstamp shifting technique from ORC code (thus I avoid using 
fromUTCTime/toUTCTime)
   - Passing Hybrid Julian values to ORC on write, and assuming ORC retrieves a 
Hybrid Julian value on read.
   
   That seems to have eliminated the pre-Railway and pre-Gregorian issues.
   
   But there is more...
   
   The ORC API accepts and returns Timestamp objects for writing and reading 
ORC timestamp fields. This alone introduces some oddities that will be 
noticable to end users.
   
   For example, not all timestamp_ntz values can be represented with Timestamp 
in all time zones. The date/time '2021-03-14 02:15:00.0' doesn't exist in the 
America/Los_Angeles time zone.
   ```
   scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
   
   scala> val ts = java.sql.Timestamp.valueOf("2021-03-14 02:15:00.0")
   ts: java.sql.Timestamp = 2021-03-14 03:15:00.0
   
   scala>
   ```
   So we will have oddities like this (using my POC code):
   ```
   scala> import java.util.TimeZone
   import java.util.TimeZone
   
   scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
   
   scala> sql("select timestamp_ntz'2021-03-14 02:15:00.0' as 
ts_ntz").write.mode("overwrite").orc("test")
   
   scala> spark.read.orc("test").show(false)
   +-------------------+
   |ts_ntz             |
   +-------------------+
   |2021-03-14 01:15:00|
   +-------------------+
   
   
   scala> 
   ```
   With this PR, you actually see it not with "spring forward", but with "fall 
back" (not sure why):
   ```
   scala> import java.util.TimeZone
   import java.util.TimeZone
   
   scala> TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles"))
   
   scala> sql("select timestamp_ntz'1996-10-27T09:10:25.088353' as 
ts_ntz").write.mode("overwrite").orc("test")
   
   scala> spark.read.orc("test").show(false)
   +--------------------------+
   |ts_ntz                    |
   +--------------------------+
   |1996-10-27 08:10:25.088353|
   +--------------------------+
   
   
   scala>
   ```
   So to summarize:
   - ORC needs the correct epoch values (pre-shifted, rebased to Hybrid Julian) 
to determine the correct offsets when shifting on read.
   - You can't use fromUTCTime/toUTCTime (or convertTZ) for shifting 
pre-Railway datetime values.
   - For the ORC timestamp type, the ORC API receives and returns Timestamp 
objects, but Timestamp objects alone introduce oddities.
   - It would certainly be nice if we could save and retrieve a long value (ala 
useUTCTime) without affecting timestamp_ltz.
   
   Edit: caveat: I am no expert in the time APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bersprockets commented on a change in pull request #34741: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone

Reply via email to