[GitHub] spark pull request #21186: [SPARK-22279][SPARK-24112] Enable `convertMetasto...

2018-04-27 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/21186

[SPARK-22279][SPARK-24112] Enable `convertMetastoreOrc` and add 
`convertMetastore.TableProperty` conf

## What changes were proposed in this pull request?

We reverted `spark.sql.hive.convertMetastoreOrc` at 
https://github.com/apache/spark/pull/20536 because we should not ignore the 
table-specific compression conf. Now, it's resolved via 
[SPARK-23355](https://github.com/apache/spark/commit/8aa1d7b0ede5115297541d29eab4ce5f4fe905cb).

This PR aims to enable `convertMetastoreOrc` by default again like Parquet. 
Also, in order to provide full backward-compatibility, this also introduces 
additional configuration `spark.sql.hive.convertMetastoreTableProperty` to 
restore the previous behavior for ignoring table properties.

## How was this patch tested?

Pass the Jenkins.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-24112

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21186


commit 5383299738877b76c46d603635520e77dad52fd9
Author: Dongjoon Hyun 
Date:   2018-04-27T18:10:55Z

[SPARK-22279][SPARK-24112] Enable `convertMetastoreOrc` and add 
`convertMetastore.TableProperty` conf




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21186: [SPARK-22279][SPARK-24112] Enable `convertMetasto...

2018-05-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21186#discussion_r185980350
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - Since Spark 2.4, Spark uses its own ORC support by default instead of 
Hive SerDe for better performance during Hive metastore table access. To set 
`false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+  - Since Spark 2.4, Spark supports table properties while converting 
Parquet/ORC Hive tables. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
--- End diff --

please polish the migration guide w.r.t. 
https://issues.apache.org/jira/browse/SPARK-24175


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21186: [SPARK-22279][SPARK-24112] Enable `convertMetasto...

2018-05-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21186#discussion_r185988150
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - Since Spark 2.4, Spark uses its own ORC support by default instead of 
Hive SerDe for better performance during Hive metastore table access. To set 
`false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+  - Since Spark 2.4, Spark supports table properties while converting 
Parquet/ORC Hive tables. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
--- End diff --

Sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21186: [SPARK-22279][SPARK-24112] Enable `convertMetasto...

2018-05-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21186#discussion_r186193847
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1812,6 +1812,9 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, creating a managed table with nonempty location is 
not allowed. An exception is thrown when attempting to create a managed table 
with nonempty location. To set `true` to 
`spark.sql.allowCreatingManagedTableUsingNonemptyLocation` restores the 
previous behavior. This option will be removed in Spark 3.0.
   - Since Spark 2.4, the type coercion rules can automatically promote the 
argument types of the variadic SQL functions (e.g., IN/COALESCE) to the widest 
common type, no matter how the input arguments order. In prior Spark versions, 
the promotion could fail in some specific orders (e.g., TimestampType, 
IntegerType and StringType) and throw an exception.
   - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
+  - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark supports Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files. To set `false` to 
`spark.sql.hive.convertMetastoreTableProperty` restores the previous behavior.
+  - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
+
--- End diff --

@cloud-fan and @gatorsmile . I updated according to the guideline 
[SPARK-24175](https://issues.apache.org/jira/browse/SPARK-24175).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org