[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-05-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21169


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-05-02 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r185528731
  
--- Diff: sql/core/src/test/resources/sql-tests/results/datetime.sql.out ---
@@ -82,9 +82,138 @@ struct
 1  2
 2  3
 
+
 -- !query 9
 select weekday('2007-02-03'), weekday('2009-07-30'), 
weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15')
--- !query 3 schema
+-- !query 9 schema
 struct
--- !query 3 output
+-- !query 9 output
 5  3   5   NULL4
+
+
+-- !query 10
+select from_utc_timestamp('2015-07-24 00:00:00', 'PST')
+-- !query 10 schema
+struct
+-- !query 10 output
+2015-07-23 17:00:00
+
+
+-- !query 11
+select from_utc_timestamp('2015-01-24 00:00:00', 'PST')
+-- !query 11 schema
+struct
+-- !query 11 output
+2015-01-23 16:00:00
+
+
+-- !query 12
+select from_utc_timestamp(null, 'PST')
+-- !query 12 schema
+struct
+-- !query 12 output
+NULL
+
+
+-- !query 13
+select from_utc_timestamp('2015-07-24 00:00:00', null)
+-- !query 13 schema
+struct
+-- !query 13 output
+NULL
+
+
+-- !query 14
+select from_utc_timestamp(null, null)
+-- !query 14 schema
+struct
+-- !query 14 output
+NULL
+
+
+-- !query 15
+select from_utc_timestamp(cast(0 as timestamp), 'PST')
+-- !query 15 schema
+struct
+-- !query 15 output
+1969-12-31 08:00:00
--- End diff --

I think it's doable, but we should make `from_utc_timestamp` to accept 
integral type directly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-30 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r185057781
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1208,6 +1208,13 @@ object SQLConf {
   .stringConf
   .createWithDefault("")
 
+  val REJECT_TIMEZONE_IN_STRING = 
buildConf("spark.sql.function.rejectTimezoneInString")
+.internal()
+.doc("If true, `to_utc_timestamp` and `from_utc_timestamp` return null 
if the input string " +
+  "contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
+.booleanConf
+.createWithDefault(true)
+
--- End diff --

>existing workloads may depend on the previous behavior and think that's 
corrected

Then I think we need test cases (maybe in DateFunctionsSuite, since those 
tests would also exercise the type coercion), where REJECT_TIMEZONE_IN_STRING 
is false.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-30 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r185067161
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema.
+  - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type 
after promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
+  - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
+  - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone, and 
returns weird 

[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184852914
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -296,10 +296,28 @@ object DateTimeUtils {
* `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
*/
   def stringToTimestamp(s: UTF8String): Option[SQLTimestamp] = {
-stringToTimestamp(s, defaultTimeZone())
+stringToTimestamp(s, defaultTimeZone(), rejectTzInString = false)
   }
 
   def stringToTimestamp(s: UTF8String, timeZone: TimeZone): 
Option[SQLTimestamp] = {
+stringToTimestamp(s, timeZone, rejectTzInString = false)
+  }
+
+  /**
+   * Converts a timestamp string to microseconds from the unix epoch, 
w.r.t. the given timezone.
--- End diff --

BTW, I usually avoid abbreviation in doc tho (w.r.t.).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-27 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184731765
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1208,6 +1208,13 @@ object SQLConf {
   .stringConf
   .createWithDefault("")
 
+  val REJECT_TIMEZONE_IN_STRING = 
buildConf("spark.sql.function.rejectTimezoneInString")
+.internal()
+.doc("If true, `to_utc_timestamp` and `from_utc_timestamp` return null 
if the input string " +
+  "contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
+.booleanConf
+.createWithDefault(true)
+
--- End diff --

existing workloads may depend on the previous behavior and think that's 
corrected. It's safer to provide an internal conf and tell users about it when 
they complain about behavior change. It's an internal conf and is invisible to 
end users. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-27 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184687634
  
--- Diff: sql/core/src/test/resources/sql-tests/results/datetime.sql.out ---
@@ -82,9 +82,138 @@ struct
 1  2
 2  3
 
+
 -- !query 9
 select weekday('2007-02-03'), weekday('2009-07-30'), 
weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15')
--- !query 3 schema
+-- !query 9 schema
 struct
--- !query 3 output
+-- !query 9 output
 5  3   5   NULL4
+
+
+-- !query 10
+select from_utc_timestamp('2015-07-24 00:00:00', 'PST')
+-- !query 10 schema
+struct
+-- !query 10 output
+2015-07-23 17:00:00
+
+
+-- !query 11
+select from_utc_timestamp('2015-01-24 00:00:00', 'PST')
+-- !query 11 schema
+struct
+-- !query 11 output
+2015-01-23 16:00:00
+
+
+-- !query 12
+select from_utc_timestamp(null, 'PST')
+-- !query 12 schema
+struct
+-- !query 12 output
+NULL
+
+
+-- !query 13
+select from_utc_timestamp('2015-07-24 00:00:00', null)
+-- !query 13 schema
+struct
+-- !query 13 output
+NULL
+
+
+-- !query 14
+select from_utc_timestamp(null, null)
+-- !query 14 schema
+struct
+-- !query 14 output
+NULL
+
+
+-- !query 15
+select from_utc_timestamp(cast(0 as timestamp), 'PST')
+-- !query 15 schema
+struct
+-- !query 15 output
+1969-12-31 08:00:00
--- End diff --

Since we're adding new SQLConf settings anyway, we could have a 
"timestamp.hive.compatibility" (or something like that), that is true by 
default and allows select from_utc_timestamp(cast(0 as timestamp), 'PST') to 
continue to produce the above answer. However, when false, it would treat 0 as 
1970-01-01T00:00:00 UTC, so the above would instead produce the answer 
'1969-12-31 16:00:00' (which we both agree in the Jira is probably the more 
correct answer). Just a thought.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-27 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184685378
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1208,6 +1208,13 @@ object SQLConf {
   .stringConf
   .createWithDefault("")
 
+  val REJECT_TIMEZONE_IN_STRING = 
buildConf("spark.sql.function.rejectTimezoneInString")
+.internal()
+.doc("If true, `to_utc_timestamp` and `from_utc_timestamp` return null 
if the input string " +
+  "contains a timezone part, e.g. `2000-10-10 00:00:00+00:00`.")
+.booleanConf
+.createWithDefault(true)
+
--- End diff --

Why would we need this? Currently, if a user passes '2000-10-10 
00:00:00+00:00' to _utc_timestamp, they get the wrong answer. If they 
switch off this setting, they will continue to get the wrong answer rather than 
null. Are we accommodating the users who experienced this bug and are manually 
shifting the result?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-27 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184608524
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema.
--- End diff --

Let us create a dedicated ticket and let the community improve them?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184597402
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema.
--- End diff --

like `new StructType("empty", new StructType())`, the table has a column, 
the column is struct type but has 0 fields. This schema is invalid to write out.

Anyway this is an existing comment and I just fixed its indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184596334
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
+  - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema.
--- End diff --

what's a nested empty schema?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184590539
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1805,12 +1805,13 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
 
   - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader 
for ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
   - In PySpark, when Arrow optimization is enabled, previously `toPandas` 
just failed when Arrow optimization is unable to be used whereas 
`createDataFrame` from Pandas DataFrame allowed the fallback to 
non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas 
DataFrame allow the fallback by default, which can be switched off by 
`spark.sql.execution.arrow.fallback.enabled`.
- - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
- - Since Spark 2.4, expression IDs in UDF arguments do not appear in 
column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS 
colA#28)` but ``UDF:f(col0 AS `colA`)``.
- - Since Spark 2.4, writing a dataframe with an empty or nested empty 
schema using any file formats (parquet, orc, json, text, csv etc.) is not 
allowed. An exception is thrown when attempting to write dataframes with empty 
schema. 
- - Since Spark 2.4, Spark compares a DATE type with a TIMESTAMP type after 
promotes both sides to TIMESTAMP. To set `false` to 
`spark.sql.hive.compareDateTimestampInTimestamp` restores the previous 
behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, creating a managed table with nonempty location is not 
allowed. An exception is thrown when attempting to create a managed table with 
nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
- - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at 
least one write task, even if physically the dataframe has no partition. This 
introduces a small behavior change that for self-describing file formats like 
Parquet and Orc, Spark creates a metadata-only file in the target directory 
when writing a 0-partition dataframe, so that schema inference can still work 
if users read that directory later. The new behavior is more reasonable and 
more consistent regarding writing empty dataframe.
--- End diff --

fix indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184589060
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -782,6 +782,22 @@ object TypeCoercion {
   // Skip nodes who's children have not been resolved yet.
   case e if !e.childrenResolved => e
 
+  // Special rules for `to/from_utc_timestamp`. 
`to/from_utc_timestamp` assumes its input is
+  // in UTC timezone, and if input is string, it should not contain 
timezone.
+  // TODO: We should move the type coercion logic to expressions 
instead of a central
+  // place to put all the rules.
+  case e: FromUTCTimestamp if e.left.dataType == StringType =>
--- End diff --

Catalyst suggests rules in the same batch is order insensitive, since here 
this rule must be run before implicit type cast, we'd better put them in the 
same rule to guarantee the order.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184569738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -782,6 +782,22 @@ object TypeCoercion {
   // Skip nodes who's children have not been resolved yet.
   case e if !e.childrenResolved => e
 
+  // Special rules for `to/from_utc_timestamp`. 
`to/from_utc_timestamp` assumes its input is
+  // in UTC timezone, and if input is string, it should not contain 
timezone.
+  // TODO: We should move the type coercion logic to expressions 
instead of a central
+  // place to put all the rules.
+  case e: FromUTCTimestamp if e.left.dataType == StringType =>
--- End diff --

Should these checks go in their own rule that runs before ImplicitTypeCasts?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184565093
  
--- Diff: sql/core/src/test/resources/sql-tests/inputs/datetime.sql ---
@@ -27,3 +27,8 @@ select current_date = current_date(), current_timestamp = 
current_timestamp(), a
 select a, b from ttf2 order by a, current_date;
 
 select weekday('2007-02-03'), weekday('2009-07-30'), 
weekday('2017-05-27'), weekday(null), weekday('1582-10-15 13:10:15');
+
--- End diff --

Does it matter if there are no success cases for from_utc_timestamp and 
to_utc_timestamp in here (that is, cases that don't return null)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread bersprockets
Github user bersprockets commented on a diff in the pull request:

https://github.com/apache/spark/pull/21169#discussion_r184561518
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -296,10 +296,27 @@ object DateTimeUtils {
* `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us]+[h]h:[m]m`
*/
   def stringToTimestamp(s: UTF8String): Option[SQLTimestamp] = {
-stringToTimestamp(s, defaultTimeZone())
+stringToTimestamp(s, defaultTimeZone(), forceTimezone = false)
   }
 
   def stringToTimestamp(s: UTF8String, timeZone: TimeZone): 
Option[SQLTimestamp] = {
+stringToTimestamp(s, timeZone, forceTimezone = false)
+  }
+
+  /**
+   * Converts a timestamp string to microseconds from the unix epoch, 
w.r.t. the given timezone.
+   * Returns None if the input string is not a valid timestamp format.
+   *
+   * @param s the input timestamp string.
+   * @param timeZone the timezone of the timestamp string, will be ignored 
if the timestamp string
+   * already contains timezone information and 
`forceTimezone` is false.
+   * @param forceTimezone if true, force to apply the given timezone to 
the timestamp string. If the
+   *  timestamp string already contains timezone, 
return None.
+   */
+  def stringToTimestamp(
+  s: UTF8String,
+  timeZone: TimeZone,
+  forceTimezone: Boolean): Option[SQLTimestamp] = {
--- End diff --

It seems more like rejectTzInString or rejectStringTZ.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21169: [SPARK-23715][SQL] the input of to/from_utc_times...

2018-04-26 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/21169

[SPARK-23715][SQL] the input of to/from_utc_timestamp can not have timezone

## What changes were proposed in this pull request?

`from_utc_timestamp` assumes its input is in UTC timezone and shifts it to 
the specified timezone. When the timestamp contains timezone(e.g. 
`2018-03-13T06:18:23+00:00`), Spark breaks the semantic and respect the 
timezone in the string. This is not what user expects and the result is 
different from Hive/Scala. `to_utc_timestamp` has the same problem.

This PR fixes this by returning null if the input timestamp contains 
timezone.

TODO: add a config

## How was this patch tested?

new tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark from_utc_timezone

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21169.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21169


commit 7c1dcc3f3c144fe2aa1296c84840ff27a5a250e1
Author: Wenchen Fan 
Date:   2018-04-26T16:01:38Z

SPARK-23715: the input of to/from_utc_timestamp can not have timezone




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org