How did you read/write the timestamp value from/to ORC file?

On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <shubh.chaura...@gmail.com>
wrote:

> Hi All,
>
> Consider the following(spark v2.4.0):
>
> Basically I change values of `spark.sql.session.timeZone` and perform an
> orc write. Here are 3 samples:-
>
> 1)
> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>
> scala> val df = sc.parallelize(Seq("2019-04-23
> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>
> df.show() Output                  ORC File Contents
> -------------------------------------------------------------
> 2019-04-23 09:15:04           {"ts":"2019-04-23 09:15:04.0"}
>
> 2)
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>
> df.show() Output                  ORC File Contents
> -------------------------------------------------------------
> 2019-04-23 03:45:04           {"ts":"2019-04-23 09:15:04.0"}
>
> 3)
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>
> df.show() Output                  ORC File Contents
> -------------------------------------------------------------
> 2019-04-22 20:45:04           {"ts":"2019-04-23 09:15:04.0"}
>
> It can be seen that in all the three cases it stores {"ts":"2019-04-23
> 09:15:04.0"} in orc file. I understand that orc file also contains writer
> timezone with respect to which spark is able to convert back to actual time
> when it reads orc.(and that is equal to df.show())
>
> But it's problematic in the sense that it is not adjusting(plus/minus)
> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
> 09:15:04.0"} in ORC file. I mean loading data to any system other than
> spark would be a problem.
>
> Any ideas/suggestions on that?
>
> PS: For csv files, it stores exactly what we see as the output of df.show()
>
> Thanks,
> Shubham
>
>

Reply via email to