spark git commit: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default

lixiao Tue, 20 Feb 2018 09:15:32 -0800

Repository: spark
Updated Branches:
  refs/heads/master 189f56f3d -> 83c008762



[SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default

## What changes were proposed in this pull request?

Apache Spark 2.3 introduced `native` ORC supports with vectorization and many 
fixes. However, it's shipped as a not-default option. This PR enables `native` 
ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We 
will improve and stabilize ORC data source before Apache Spark 2.4. And, 
eventually, Apache Spark will drop old Hive-based ORC code.

## How was this patch tested?

Pass the Jenkins with existing tests.

Author: Dongjoon Hyun <dongj...@apache.org>

Closes #20634 from dongjoon-hyun/SPARK-23456.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/83c00876
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/83c00876
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/83c00876

Branch: refs/heads/master
Commit: 83c008762af444eef73d835eb6f506ecf5aebc17
Parents: 189f56f
Author: Dongjoon Hyun <dongj...@apache.org>
Authored: Tue Feb 20 09:14:56 2018 -0800
Committer: gatorsmile <gatorsm...@gmail.com>
Committed: Tue Feb 20 09:14:56 2018 -0800

----------------------------------------------------------------------
 docs/sql-programming-guide.md                                  | 6 +++++-
 .../src/main/scala/org/apache/spark/sql/internal/SQLConf.scala | 6 +++---
 2 files changed, 8 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/83c00876/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 91e4367..c37c338 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1018,7 +1018,7 @@ the vectorized reader is used when 
`spark.sql.hive.convertMetastoreOrc` is also
   <tr>
     <td><code>spark.sql.orc.impl</code></td>
     <td><code>hive</code></td>
-    <td>The name of ORC implementation. It can be one of <code>native</code> 
and <code>hive</code>. <code>native</code> means the native ORC support that is 
built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1.</td>
+    <td>The name of ORC implementation. It can be one of <code>native</code> 
and <code>hive</code>. <code>native</code> means the native ORC support that is 
built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
   </tr>
   <tr>
     <td><code>spark.sql.orc.enableVectorizedReader</code></td>
@@ -1797,6 +1797,10 @@ working with timestamps in `pandas_udf`s to get the best 
performance, see
 
 # Migration Guide
 
+## Upgrading From Spark SQL 2.3 to 2.4
+
+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for 
ORC files by default. To do that, `spark.sql.orc.impl` and 
`spark.sql.orc.filterPushdown` change their default values to `native` and 
`true` respectively.
+
 ## Upgrading From Spark SQL 2.2 to 2.3
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when 
the referenced columns only include the internal corrupt record column (named 
`_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.

http://git-wip-us.apache.org/repos/asf/spark/blob/83c00876/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
index e75e1d6..ce3f946 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@@ -399,11 +399,11 @@ object SQLConf {
 
   val ORC_IMPLEMENTATION = buildConf("spark.sql.orc.impl")
     .doc("When native, use the native version of ORC support instead of the 
ORC library in Hive " +
-      "1.2.1. It is 'hive' by default.")
+      "1.2.1. It is 'hive' by default prior to Spark 2.4.")
     .internal()
     .stringConf
     .checkValues(Set("hive", "native"))
-    .createWithDefault("hive")
+    .createWithDefault("native")
 
   val ORC_VECTORIZED_READER_ENABLED = 
buildConf("spark.sql.orc.enableVectorizedReader")
     .doc("Enables vectorized orc decoding.")
@@ -426,7 +426,7 @@ object SQLConf {
   val ORC_FILTER_PUSHDOWN_ENABLED = buildConf("spark.sql.orc.filterPushdown")
     .doc("When true, enable filter pushdown for ORC files.")
     .booleanConf
-    .createWithDefault(false)
+    .createWithDefault(true)
 
   val HIVE_VERIFY_PARTITION_PATH = 
buildConf("spark.sql.hive.verifyPartitionPath")
     .doc("When true, check all the partition paths under the table\'s root 
directory " +


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-23456][SPARK-21783] Turn on `native` ORC impl and PPD by default

Reply via email to