DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Shubham Chaurasia Wed, 24 Apr 2019 03:30:50 -0700

Hi All,

Consider the following(spark v2.4.0):


Basically I change values of `spark.sql.session.timeZone` and perform an
orc write. Here are 3 samples:-

1)
scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")

scala> val df = sc.parallelize(Seq("2019-04-23
09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

df.show() Output                  ORC File Contents
-------------------------------------------------------------
2019-04-23 09:15:04           {"ts":"2019-04-23 09:15:04.0"}

2)
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

df.show() Output                  ORC File Contents
-------------------------------------------------------------
2019-04-23 03:45:04           {"ts":"2019-04-23 09:15:04.0"}

3)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

df.show() Output                  ORC File Contents
-------------------------------------------------------------
2019-04-22 20:45:04           {"ts":"2019-04-23 09:15:04.0"}

It can be seen that in all the three cases it stores {"ts":"2019-04-23
09:15:04.0"} in orc file. I understand that orc file also contains writer
timezone with respect to which spark is able to convert back to actual time
when it reads orc.(and that is equal to df.show())

But it's problematic in the sense that it is not adjusting(plus/minus)
timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
09:15:04.0"} in ORC file. I mean loading data to any system other than
spark would be a problem.

Any ideas/suggestions on that?

PS: For csv files, it stores exactly what we see as the output of df.show()

Thanks,
Shubham

DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Reply via email to