Writing: scala> df.write.orc("<some_path>") For looking into contents, I used orc-tools-X.Y.Z-uber.jar ( https://orc.apache.org/docs/java-tools.html)
On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan <cloud0...@gmail.com> wrote: > How did you read/write the timestamp value from/to ORC file? > > On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia < > shubh.chaura...@gmail.com> wrote: > >> Hi All, >> >> Consider the following(spark v2.4.0): >> >> Basically I change values of `spark.sql.session.timeZone` and perform an >> orc write. Here are 3 samples:- >> >> 1) >> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") >> >> scala> val df = sc.parallelize(Seq("2019-04-23 >> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp")) >> df: org.apache.spark.sql.DataFrame = [ts: timestamp] >> >> df.show() Output ORC File Contents >> ------------------------------------------------------------- >> 2019-04-23 09:15:04 {"ts":"2019-04-23 09:15:04.0"} >> >> 2) >> scala> spark.conf.set("spark.sql.session.timeZone", "UTC") >> >> df.show() Output ORC File Contents >> ------------------------------------------------------------- >> 2019-04-23 03:45:04 {"ts":"2019-04-23 09:15:04.0"} >> >> 3) >> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") >> >> df.show() Output ORC File Contents >> ------------------------------------------------------------- >> 2019-04-22 20:45:04 {"ts":"2019-04-23 09:15:04.0"} >> >> It can be seen that in all the three cases it stores {"ts":"2019-04-23 >> 09:15:04.0"} in orc file. I understand that orc file also contains writer >> timezone with respect to which spark is able to convert back to actual time >> when it reads orc.(and that is equal to df.show()) >> >> But it's problematic in the sense that it is not adjusting(plus/minus) >> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23 >> 09:15:04.0"} in ORC file. I mean loading data to any system other than >> spark would be a problem. >> >> Any ideas/suggestions on that? >> >> PS: For csv files, it stores exactly what we see as the output of >> df.show() >> >> Thanks, >> Shubham >> >>