Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
Did you re-create your df when you update the timezone conf?

On Wed, Apr 24, 2019 at 9:18 PM Shubham Chaurasia 
wrote:

> Writing:
> scala> df.write.orc("")
>
> For looking into contents, I used orc-tools-X.Y.Z-uber.jar (
> https://orc.apache.org/docs/java-tools.html)
>
> On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan  wrote:
>
>> How did you read/write the timestamp value from/to ORC file?
>>
>> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <
>> shubh.chaura...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Consider the following(spark v2.4.0):
>>>
>>> Basically I change values of `spark.sql.session.timeZone` and perform an
>>> orc write. Here are 3 samples:-
>>>
>>> 1)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>>>
>>> scala> val df = sc.parallelize(Seq("2019-04-23
>>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
>>> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 2)
>>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> 3)
>>> scala> spark.conf.set("spark.sql.session.timeZone",
>>> "America/Los_Angeles")
>>>
>>> df.show() Output  ORC File Contents
>>> -
>>> 2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>>
>>> It can be seen that in all the three cases it stores {"ts":"2019-04-23
>>> 09:15:04.0"} in orc file. I understand that orc file also contains writer
>>> timezone with respect to which spark is able to convert back to actual time
>>> when it reads orc.(and that is equal to df.show())
>>>
>>> But it's problematic in the sense that it is not adjusting(plus/minus)
>>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
>>> 09:15:04.0"} in ORC file. I mean loading data to any system other than
>>> spark would be a problem.
>>>
>>> Any ideas/suggestions on that?
>>>
>>> PS: For csv files, it stores exactly what we see as the output of
>>> df.show()
>>>
>>> Thanks,
>>> Shubham
>>>
>>>


Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia
Writing:
scala> df.write.orc("")

For looking into contents, I used orc-tools-X.Y.Z-uber.jar (
https://orc.apache.org/docs/java-tools.html)

On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan  wrote:

> How did you read/write the timestamp value from/to ORC file?
>
> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia <
> shubh.chaura...@gmail.com> wrote:
>
>> Hi All,
>>
>> Consider the following(spark v2.4.0):
>>
>> Basically I change values of `spark.sql.session.timeZone` and perform an
>> orc write. Here are 3 samples:-
>>
>> 1)
>> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>>
>> scala> val df = sc.parallelize(Seq("2019-04-23
>> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
>> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>>
>> df.show() Output  ORC File Contents
>> -
>> 2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}
>>
>> 2)
>> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>>
>> df.show() Output  ORC File Contents
>> -
>> 2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>
>> 3)
>> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>>
>> df.show() Output  ORC File Contents
>> -
>> 2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}
>>
>> It can be seen that in all the three cases it stores {"ts":"2019-04-23
>> 09:15:04.0"} in orc file. I understand that orc file also contains writer
>> timezone with respect to which spark is able to convert back to actual time
>> when it reads orc.(and that is equal to df.show())
>>
>> But it's problematic in the sense that it is not adjusting(plus/minus)
>> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
>> 09:15:04.0"} in ORC file. I mean loading data to any system other than
>> spark would be a problem.
>>
>> Any ideas/suggestions on that?
>>
>> PS: For csv files, it stores exactly what we see as the output of
>> df.show()
>>
>> Thanks,
>> Shubham
>>
>>


Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file?

On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia 
wrote:

> Hi All,
>
> Consider the following(spark v2.4.0):
>
> Basically I change values of `spark.sql.session.timeZone` and perform an
> orc write. Here are 3 samples:-
>
> 1)
> scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")
>
> scala> val df = sc.parallelize(Seq("2019-04-23
> 09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
>
> df.show() Output  ORC File Contents
> -
> 2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}
>
> 2)
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
>
> df.show() Output  ORC File Contents
> -
> 2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}
>
> 3)
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
>
> df.show() Output  ORC File Contents
> -
> 2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}
>
> It can be seen that in all the three cases it stores {"ts":"2019-04-23
> 09:15:04.0"} in orc file. I understand that orc file also contains writer
> timezone with respect to which spark is able to convert back to actual time
> when it reads orc.(and that is equal to df.show())
>
> But it's problematic in the sense that it is not adjusting(plus/minus)
> timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
> 09:15:04.0"} in ORC file. I mean loading data to any system other than
> spark would be a problem.
>
> Any ideas/suggestions on that?
>
> PS: For csv files, it stores exactly what we see as the output of df.show()
>
> Thanks,
> Shubham
>
>


DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Shubham Chaurasia
Hi All,

Consider the following(spark v2.4.0):

Basically I change values of `spark.sql.session.timeZone` and perform an
orc write. Here are 3 samples:-

1)
scala> spark.conf.set("spark.sql.session.timeZone", "Asia/Kolkata")

scala> val df = sc.parallelize(Seq("2019-04-23
09:15:04.0")).toDF("ts").withColumn("ts", col("ts").cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

df.show() Output  ORC File Contents
-
2019-04-23 09:15:04   {"ts":"2019-04-23 09:15:04.0"}

2)
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")

df.show() Output  ORC File Contents
-
2019-04-23 03:45:04   {"ts":"2019-04-23 09:15:04.0"}

3)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

df.show() Output  ORC File Contents
-
2019-04-22 20:45:04   {"ts":"2019-04-23 09:15:04.0"}

It can be seen that in all the three cases it stores {"ts":"2019-04-23
09:15:04.0"} in orc file. I understand that orc file also contains writer
timezone with respect to which spark is able to convert back to actual time
when it reads orc.(and that is equal to df.show())

But it's problematic in the sense that it is not adjusting(plus/minus)
timezone (spark.sql.session.timeZone) offsets for {"ts":"2019-04-23
09:15:04.0"} in ORC file. I mean loading data to any system other than
spark would be a problem.

Any ideas/suggestions on that?

PS: For csv files, it stores exactly what we see as the output of df.show()

Thanks,
Shubham