[ 
https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202322#comment-16202322
 ] 

Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM:
---------------------------------------------------------------------

Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
            .json(path)
            .withColumn("year", year($"_time"))
            .withColumn("month", month($"_time"))
            .withColumn("day", dayofmonth($"_time"))
            .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
            .json(path)
            .withColumn("year", year($"_time"))
            .withColumn("month", month($"_time"))
            .withColumn("day", dayofmonth($"_time"))
            .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?


was (Author: hangleton):
Hello all,

I have a use case where a {{Dataset}} contains a column of type 
{{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new 
columns with the year, month, day and hour specified by the {{_time}} column, 
with something like:
{code:java}
session.read.schema(mySchema)
           .json(path)
           .withColumn("year", year($"_time"))
           .withColumn("month", month($"_time"))
           .withColumn("day", dayofmonth($"_time"))
           .withColumn("hour", hour($"_time"))
{code}
using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions 
defined in {{org.apache.spark.sql.functions}}.

Now let's assume the timezone is row dependent - and let's call {{_tz}} the 
column which contains it.The timezone is at the row level which is why I cannot 
configure the {{DataFrameWriter}} with a {{timeZone}} option.
I wondered if something like this would be advisable:
{code:java}
session.read.schema(mySchema)
          .json(path)
          .withColumn("year", year($"_time"))
          .withColumn("month", month($"_time"))
          .withColumn("day", dayofmonth($"_time"))
          .withColumn("hour", hour($"_time", $"_tz"))
{code}
Having a look at the definition of the {{hour}} function, it uses an {{Hour}} 
expression which can be constructed with an optional {{timeZoneId}}.
I have been trying to create an {{Hour}} expression but this is Spark-internal 
construct - and the API forbids to use it directly.
I guess providing a function {{hour(t: Column, tz: Column)}} along with the 
existing {{hour(t: Column)}} would not be a satisfying design.

Do you think a somehow elegant solution exists for this use case? Or is the 
methodology I use flawed - i.e. I should not derive the hour from a timestamp 
column if it happens to rely on a not predefined, row-dependent time zone like 
this?

> Support session local timezone
> ------------------------------
>
>                 Key: SPARK-18350
>                 URL: https://issues.apache.org/jira/browse/SPARK-18350
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Assignee: Takuya Ueshin
>              Labels: releasenotes
>             Fix For: 2.2.0
>
>         Attachments: sample.csv
>
>
> As of Spark 2.1, Spark SQL assumes the machine timezone for datetime 
> manipulation, which is bad if users are not in the same timezones as the 
> machines, or if different users have different timezones.
> We should introduce a session local timezone setting that is used for 
> execution.
> An explicit non-goal is locale handling.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to