Repository: spark
Updated Branches:
  refs/heads/master a8031183a -> d96d7b557


[DOC] [SQL] Addes Hive metastore Parquet table conversion section

This PR adds a section about Hive metastore Parquet table conversion. It 
documents:

1. Schema reconciliation rules introduced in #5214 (see [this comment] [1] in 
#5188)
2. Metadata refreshing requirement introduced in #5339

[1]: https://github.com/apache/spark/pull/5188#issuecomment-86531248

Author: Cheng Lian <l...@databricks.com>

Closes #5348 from liancheng/sql-doc-parquet-conversion and squashes the 
following commits:

42ae0d0 [Cheng Lian] Adds Python `refreshTable` snippet
4c9847d [Cheng Lian] Resorts to SQL for Python metadata refreshing snippet
756e660 [Cheng Lian] Adds Python snippet for metadata refreshing
50675db [Cheng Lian] Addes Hive metastore Parquet table conversion section


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d96d7b55
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d96d7b55
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d96d7b55

Branch: refs/heads/master
Commit: d96d7b55746cf034e3935ec4b22614a99e48c498
Parents: a803118
Author: Cheng Lian <l...@databricks.com>
Authored: Tue Jun 23 14:19:21 2015 -0700
Committer: Cheng Lian <l...@databricks.com>
Committed: Tue Jun 23 14:19:21 2015 -0700

----------------------------------------------------------------------
 docs/sql-programming-guide.md | 94 +++++++++++++++++++++++++++++++++++---
 1 file changed, 88 insertions(+), 6 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/d96d7b55/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 26c036f..9107c9b 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -22,7 +22,7 @@ The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.
 All of the examples on this page use sample data included in the Spark 
distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` 
shell.
 
 
-## Starting Point: `SQLContext`
+## Starting Point: SQLContext
 
 <div class="codetabs">
 <div data-lang="scala"  markdown="1">
@@ -1036,6 +1036,15 @@ for (teenName in collect(teenNames)) {
 
 </div>
 
+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+# sqlContext is an existing HiveContext
+sqlContext.sql("REFRESH TABLE my_table")
+{% endhighlight %}
+
+</div>
+
 <div data-lang="sql"  markdown="1">
 
 {% highlight sql %}
@@ -1054,7 +1063,7 @@ SELECT * FROM parquetTable
 
 </div>
 
-### Partition discovery
+### Partition Discovery
 
 Table partitioning is a common optimization approach used in systems like 
Hive.  In a partitioned
 table, data are usually stored in different directories, with partitioning 
column values encoded in
@@ -1108,7 +1117,7 @@ can be configured by 
`spark.sql.sources.partitionColumnTypeInference.enabled`, w
 `true`. When type inference is disabled, string type will be used for the 
partitioning columns.
 
 
-### Schema merging
+### Schema Merging
 
 Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. 
 Users can start with
 a simple schema, and gradually add more columns to the schema as needed.  In 
this way, users may end
@@ -1208,6 +1217,79 @@ printSchema(df3)
 
 </div>
 
+### Hive metastore Parquet table conversion
+
+When reading from and writing to Hive metastore Parquet tables, Spark SQL will 
try to use its own
+Parquet support instead of Hive SerDe for better performance. This behavior is 
controlled by the
+`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on by 
default.
+
+#### Hive/Parquet Schema Reconciliation
+
+There are two key differences between Hive and Parquet from the perspective of 
table schema
+processing.
+
+1. Hive is case insensitive, while Parquet is not
+1. Hive considers all columns nullable, while nullability in Parquet is 
significant
+
+Due to this reason, we must reconcile Hive metastore schema with Parquet 
schema when converting a
+Hive metastore Parquet table to a Spark SQL Parquet table.  The reconciliation 
rules are:
+
+1. Fields that have the same name in both schema must have the same data type 
regardless of
+   nullability.  The reconciled field should have the data type of the Parquet 
side, so that
+   nullability is respected.
+
+1. The reconciled schema contains exactly those fields defined in Hive 
metastore schema.
+
+   - Any fields that only appear in the Parquet schema are dropped in the 
reconciled schema.
+   - Any fileds that only appear in the Hive metastore schema are added as 
nullable field in the
+     reconciled schema.
+
+#### Metadata Refreshing
+
+Spark SQL caches Parquet metadata for better performance.  When Hive metastore 
Parquet table
+conversion is enabled, metadata of those converted tables are also cached.  If 
these tables are
+updated by Hive or other external tools, you need to refresh them manually to 
ensure consistent
+metadata.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+
+{% highlight scala %}
+// sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java"  markdown="1">
+
+{% highlight java %}
+// sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+# sqlContext is an existing HiveContext
+sqlContext.refreshTable("my_table")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="sql"  markdown="1">
+
+{% highlight sql %}
+REFRESH TABLE my_table;
+{% endhighlight %}
+
+</div>
+
+</div>
+
 ### Configuration
 
 Configuration of Parquet can be done using the `setConf` method on 
`SQLContext` or by running
@@ -1445,8 +1527,8 @@ This command builds a new assembly jar that includes 
Hive. Note that this Hive a
 on all of the worker nodes, as they will need access to the Hive serialization 
and deserialization libraries
 (SerDes) in order to access data stored in Hive.
 
-Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. 
Please note when running 
-the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars 
under the `lib_managed/jars` directory 
+Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. 
Please note when running
+the query on a YARN cluster (`yarn-cluster` mode), the `datanucleus` jars 
under the `lib_managed/jars` directory
 and `hive-site.xml` under `conf/` directory need to be available on the driver 
and all executors launched by the
 YARN cluster. The convenient way to do this is adding them through the 
`--jars` option and `--file` option of the
 `spark-submit` command.
@@ -1889,7 +1971,7 @@ options.
 #### DataFrame data reader/writer interface
 
 Based on user feedback, we created a new, more fluid API for reading data in 
(`SQLContext.read`)
-and writing data out (`DataFrame.write`), 
+and writing data out (`DataFrame.write`),
 and deprecated the old APIs (e.g. `SQLContext.parquetFile`, 
`SQLContext.jsonFile`).
 
 See the API docs for `SQLContext.read` (


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to