Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Shubham Chaurasia Wed, 24 Apr 2019 06:18:46 -0700

Writing:
scala> df.write.orc("<some_path>")

For looking into contents, I used orc-tools-X.Y.Z-uber.jar (
https://orc.apache.org/docs/java-tools.html)


On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> How did you read/write the timestamp value from/to ORC file?
>
> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <
> shubh.chaura...@gmail.com> wrote:
>
>> Hi All,
>>
>> Consider the following(spark v2.4.0):
>>
>> Basically I change values of `spark.sql.session.timeZone` and perform an
>> orc write. Here are 3 samples:-
>>
>> 1)
>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>>
>> scala> val df = sc.parallelize(Seq("2019-04-23
>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
>> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>>
>> df.show() Output                  ORC File Contents
>> -------------------------------------------------------------
>> 2019-04-23 09:15:04           {"ts":"2019-04-23 09:15:04.0"}
>>
>> 2)
>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>>
>> df.show() Output                  ORC File Contents
>> -------------------------------------------------------------
>> 2019-04-23 03:45:04           {"ts":"2019-04-23 09:15:04.0"}
>>
>> 3)
>> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>
>> df.show() Output                  ORC File Contents
>> -------------------------------------------------------------
>> 2019-04-22 20:45:04           {"ts":"2019-04-23 09:15:04.0"}
>>
>> It can be seen that in all the three cases it stores {"ts":"2019-04-23
>> 09:15:04.0"} in orc file. I understand that orc file also contains writer
>> timezone with respect to which spark is able to convert back to actual time
>> when it reads orc.(and that is equal to df.show())
>>
>> But it's problematic in the sense that it is not adjusting(plus/minus)
>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
>> 09:15:04.0"} in ORC file. I mean loading data to any system other than
>> spark would be a problem.
>>
>> Any ideas/suggestions on that?
>>
>> PS: For csv files, it stores exactly what we see as the output of
>> df.show()
>>
>> Thanks,
>> Shubham
>>
>>

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

Reply via email to