Hi All, Consider the following(spark v2.4.0):
Basically I change values of `spark.sql.session.timeZone` and perform an orc write. Here are 3 samples:- 1) scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") scala> val df = sc.parallelize(Seq("2019-04-23 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp")) df: org.apache.spark.sql.DataFrame = [ts: timestamp] df.show() Output ORC File Contents ------------------------------------------------------------- 2019-04-23 09:15:04 {"ts":"2019-04-23 09:15:04.0"} 2) scala> spark.conf.set("spark.sql.session.timeZone", "UTC") df.show() Output ORC File Contents ------------------------------------------------------------- 2019-04-23 03:45:04 {"ts":"2019-04-23 09:15:04.0"} 3) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") df.show() Output ORC File Contents ------------------------------------------------------------- 2019-04-22 20:45:04 {"ts":"2019-04-23 09:15:04.0"} It can be seen that in all the three cases it stores {"ts":"2019-04-23 09:15:04.0"} in orc file. I understand that orc file also contains writer timezone with respect to which spark is able to convert back to actual time when it reads orc.(and that is equal to df.show()) But it's problematic in the sense that it is not adjusting(plus/minus) timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23 09:15:04.0"} in ORC file. I mean loading data to any system other than spark would be a problem. Any ideas/suggestions on that? PS: For csv files, it stores exactly what we see as the output of df.show() Thanks, Shubham