How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <shubh.chaura...@gmail.com> wrote:
> Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) > scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata") > > scala> val df = sc.parallelize(Seq("2019-04-23 > 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp")) > df: org.apache.spark.sql.DataFrame = [ts: timestamp] > > df.show() Output ORC File Contents > ------------------------------------------------------------- > 2019-04-23 09:15:04 {"ts":"2019-04-23 09:15:04.0"} > > 2) > scala> spark.conf.set("spark.sql.session.timeZone", "UTC") > > df.show() Output ORC File Contents > ------------------------------------------------------------- > 2019-04-23 03:45:04 {"ts":"2019-04-23 09:15:04.0"} > > 3) > scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") > > df.show() Output ORC File Contents > ------------------------------------------------------------- > 2019-04-22 20:45:04 {"ts":"2019-04-23 09:15:04.0"} > > It can be seen that in all the three cases it stores {"ts":"2019-04-23 > 09:15:04.0"} in orc file. I understand that orc file also contains writer > timezone with respect to which spark is able to convert back to actual time > when it reads orc.(and that is equal to df.show()) > > But it's problematic in the sense that it is not adjusting(plus/minus) > timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23 > 09:15:04.0"} in ORC file. I mean loading data to any system other than > spark would be a problem. > > Any ideas/suggestions on that? > > PS: For csv files, it stores exactly what we see as the output of df.show() > > Thanks, > Shubham > >