[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122317
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

do we need a whole expression for this? can't we just reuse existing 
expressions? It's just simple arithmetics isn't it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122396
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -1015,6 +1020,10 @@ object DateTimeUtils {
 guess
   }
 
+  def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): 
SQLTimestamp = {
+convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone))
--- End diff --

performance is going to suck here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122503
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

we don't seem to be doing this type of validity check in general; otherwise 
we'd need to add a lot more checks here.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122657
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand(
 isView: Boolean)
   extends RunnableCommand {
 
+  if (isView) {
+properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ =>
--- End diff --

is there even a meaning to set properties for any views? we should either 
drop this check, or have a more general check.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143122895
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TimestampTableTimeZone.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{StringType, TimestampType}
+
+/**
+ * Apply a correction to data loaded from, or saved to, Parquet, so that 
it timestamps can be read
+ * like TIMESTAMP WITHOUT TIMEZONE.  This gives correct behavior if you 
process data with
+ * machines in different timezones, or if you access the data from 
multiple SQL engines.
+ */
+private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession)
--- End diff --

i need to take a look at this again to make sure i understand what's 
happening. conceptually what you are doing is pretty simple and it doesn't seem 
right it needs so many lines of code, but maybe i'm missing something.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143242256
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

I guess you could use ToUTCTimestamp / FromUTCTimestamp for this, but that 
would be more expensive since you'd be doing the conversion twice for each 
value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243228
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -1015,6 +1020,10 @@ object DateTimeUtils {
 guess
   }
 
+  def convertTz(ts: SQLTimestamp, fromZone: String, toZone: String): 
SQLTimestamp = {
+convertTz(ts, getTimeZone(fromZone), getTimeZone(toZone))
--- End diff --

I guess caching the value as done by the `FromUTCTimestamp` expression is 
the right way to go?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243338
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

What I'm saying is the analysis rule can just determine the delta, and then 
just do a simple add/delete.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143244400
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

The 
[spec](https://docs.google.com/document/d/1XmyVjr3eOJiNFjVeSnmjIU60Hq-XiZB03pgi3r1razM/edit#heading=h.jnyh4t7bhnlx)
 requests errors when using invalid time zones. I guess this would still fail 
with a different error in that case, so I can remove this if you're really 
against adding it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143243769
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

Hmm, let me see if I can figure that out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143245229
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -230,6 +230,13 @@ case class AlterTableSetPropertiesCommand(
 isView: Boolean)
   extends RunnableCommand {
 
+  if (isView) {
+properties.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { _ =>
--- End diff --


[HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView)
 explicitly allows properties in view; I've never used them, though.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143246203
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/TimestampTableTimeZone.scala
 ---
@@ -0,0 +1,213 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.execution.datasources
+
+import org.apache.spark.sql.{AnalysisException, SparkSession}
+import org.apache.spark.sql.catalyst.TableIdentifier
+import org.apache.spark.sql.catalyst.analysis.UnresolvedException
+import org.apache.spark.sql.catalyst.catalog.CatalogTable
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.types.{StringType, TimestampType}
+
+/**
+ * Apply a correction to data loaded from, or saved to, Parquet, so that 
it timestamps can be read
+ * like TIMESTAMP WITHOUT TIMEZONE.  This gives correct behavior if you 
process data with
+ * machines in different timezones, or if you access the data from 
multiple SQL engines.
+ */
+private[sql] case class TimestampTableTimeZone(sparkSession: SparkSession)
--- End diff --

I'm also trying to see if this can be simplified; I guess the main thing is 
Imran's comment about not being able to use `transformUp`. I need to take a 
look at whether that's really the case.

This rule also doesn't seem to handle `InsertIntoHiveTable`, which is in 
the hive module so can't be handled here. Probably will need a new rule based 
on this one in the hive module.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread zivanfi
Github user zivanfi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143257840
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

Unfortunately the offset depends on the actual date, so a timezone 
conversion can not be simplified to a simple delta.

Daylight saving time starts and ends on different days in different 
timezones, while some timezones don't have DST changes at all. Additionally, 
timezone rules have changed over time and keep changing. Both the basic 
timezone offset and the DST rules themselves could change (and have changed) 
over time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143272791
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -1213,6 +1213,71 @@ case class ToUTCTimestamp(left: Expression, right: 
Expression)
 }
 
 /**
+ * This modifies a timestamp to show how the display time changes going 
from one timezone to
+ * another, for the same instant in time.
+ *
+ * We intentionally do not provide an ExpressionDescription as this is not 
meant to be exposed to
+ * users, its only used for internal conversions.
+ */
+private[spark] case class TimestampTimezoneCorrection(
--- End diff --

> Additionally, timezone rules have changed over time and keep changing. 

Ah, yes, that makes sense... it's also why my initial tests were failing at 
the DST boundaries. :-/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-06 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143288711
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

Hmm. I tried a couple of things, and while it may be possible to remove 
some of these checks and replace them with a check in 
`DateTimeUtils.computeTimeZone`, that doesn't cover all cases. For example, you 
could use "ALTER TABLE" with an invalid time zone and that wouldn't trigger the 
check.

So given the spec I'm inclined to leave the checks as is, unless @zivanfi 
is open to making the spec more lax in that area. 
(`TimeZone.getTimeZone(invalidId)` actually returns the UTC time zone, as 
unexpected as that behavior may be, so things won't necessarily break without 
the checks.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-10-09 Thread zivanfi
Github user zivanfi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r143462649
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala ---
@@ -266,6 +267,10 @@ final class DataFrameWriter[T] private[sql](ds: 
Dataset[T]) {
* @since 1.4.0
*/
   def insertInto(tableName: String): Unit = {
+extraOptions.get(TimestampTableTimeZone.TIMEZONE_PROPERTY).foreach { 
tz =>
--- End diff --

Although other table properties don't have similar checks, their effect is 
usually easy to see. The effect of this specific table property however is not 
immediately apparent: for new data it is only revealed in interoperability with 
other components, and for existing data it should not have any visible effect 
if set correctly. Therefore we decided it would be best to be very strict in 
checks, because otherwise a typo in the table property value could only be 
discovered after some data has already been written with irreversible errors. 
This was the reasoning behind this part of specs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-12-20 Thread squito
Github user squito closed the pull request at:

https://github.com/apache/spark/pull/19250


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-09-15 Thread squito
GitHub user squito opened a pull request:

https://github.com/apache/spark/pull/19250

[SPARK-12297] Table timezone correction for Timestamps

## What changes were proposed in this pull request?

When reading and writing data, spark will adjust timestamp data based on 
the delta between the current session timezone and the table time zone 
(specified either by a persistent table property, or an option to the 
DataFrameReader / Writer).  This is particularly important for parquet data, so 
that it can be treated equivalently by other SQL engines (eg. Impala and Hive). 
 Furthermore, this is useful if the same data is processed by multiple clusters 
in different time zones, and "timestamp without time zone" semantics are 
desired.

## How was this patch tested?

Unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/squito/spark timestamp_all_formats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19250.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19250


commit 54f87c2c1e0ab0645fa5497553cf031f13e98c3b
Author: Imran Rashid 
Date:   2017-08-28T19:52:15Z

SPARK-12297.  Table timezones.

commit 53b9fbe0c6128ec11afdb46d3239c693129f6952
Author: Imran Rashid 
Date:   2017-09-14T20:18:46Z

All dataformats support timezone correction.  Move rules & tests to a
more appropriate location.  Ensure rule works without hive support.
Extra checks on when table timezones are set.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19250: [SPARK-12297] Table timezone correction for Times...

2017-09-15 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19250#discussion_r139234142
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -92,7 +92,7 @@ case class CreateHiveTableAsSelectCommand(
   }
 
   override def argString: String = {
-s"[Database:${tableDesc.database}}, " +
+s"[Database:${tableDesc.database}, " +
--- End diff --

totally unrelated typo fix, but didn't seem worth an independent pr


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org