[ https://issues.apache.org/jira/browse/SPARK-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202322#comment-16202322 ]
Alexandre Dupriez edited comment on SPARK-18350 at 10/12/17 5:30 PM: --------------------------------------------------------------------- Hello all, I have a use case where a {{Dataset}} contains a column of type {{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new columns with the year, month, day and hour specified by the {{_time}} column, with something like: {code:java} session.read.schema(mySchema) .json(path) .withColumn("year", year($"_time")) .withColumn("month", month($"_time")) .withColumn("day", dayofmonth($"_time")) .withColumn("hour", hour($"_time")) {code} using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions defined in {{org.apache.spark.sql.functions}}. Now let's assume the timezone is row dependent - and let's call {{_tz}} the column which contains it.The timezone is at the row level which is why I cannot configure the {{DataFrameWriter}} with a {{timeZone}} option. I wondered if something like this would be advisable: {code:java} session.read.schema(mySchema) .json(path) .withColumn("year", year($"_time")) .withColumn("month", month($"_time")) .withColumn("day", dayofmonth($"_time")) .withColumn("hour", hour($"_time", $"_tz")) {code} Having a look at the definition of the {{hour}} function, it uses an {{Hour}} expression which can be constructed with an optional {{timeZoneId}}. I have been trying to create an {{Hour}} expression but this is Spark-internal construct - and the API forbids to use it directly. I guess providing a function {{hour(t: Column, tz: Column)}} along with the existing {{hour(t: Column)}} would not be a satisfying design. Do you think a somehow elegant solution exists for this use case? Or is the methodology I use flawed - i.e. I should not derive the hour from a timestamp column if it happens to rely on a not predefined, row-dependent time zone like this? was (Author: hangleton): Hello all, I have a use case where a {{Dataset}} contains a column of type {{java.sql.Timestamp}} (let's call it {{_time}}) which I am using to derive new columns with the year, month, day and hour specified by the {{_time}} column, with something like: {code:java} session.read.schema(mySchema) .json(path) .withColumn("year", year($"_time")) .withColumn("month", month($"_time")) .withColumn("day", dayofmonth($"_time")) .withColumn("hour", hour($"_time")) {code} using the standard {{year}}, {{month}}, {{dayofmonth}} and {{hour}} functions defined in {{org.apache.spark.sql.functions}}. Now let's assume the timezone is row dependent - and let's call {{_tz}} the column which contains it.The timezone is at the row level which is why I cannot configure the {{DataFrameWriter}} with a {{timeZone}} option. I wondered if something like this would be advisable: {code:java} session.read.schema(mySchema) .json(path) .withColumn("year", year($"_time")) .withColumn("month", month($"_time")) .withColumn("day", dayofmonth($"_time")) .withColumn("hour", hour($"_time", $"_tz")) {code} Having a look at the definition of the {{hour}} function, it uses an {{Hour}} expression which can be constructed with an optional {{timeZoneId}}. I have been trying to create an {{Hour}} expression but this is Spark-internal construct - and the API forbids to use it directly. I guess providing a function {{hour(t: Column, tz: Column)}} along with the existing {{hour(t: Column)}} would not be a satisfying design. Do you think a somehow elegant solution exists for this use case? Or is the methodology I use flawed - i.e. I should not derive the hour from a timestamp column if it happens to rely on a not predefined, row-dependent time zone like this? > Support session local timezone > ------------------------------ > > Key: SPARK-18350 > URL: https://issues.apache.org/jira/browse/SPARK-18350 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Assignee: Takuya Ueshin > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: sample.csv > > > As of Spark 2.1, Spark SQL assumes the machine timezone for datetime > manipulation, which is bad if users are not in the same timezones as the > machines, or if different users have different timezones. > We should introduce a session local timezone setting that is used for > execution. > An explicit non-goal is locale handling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org