spark git commit: [SPARK-19016][SQL][DOC] Document scalable partition handling

lian Fri, 30 Dec 2016 14:51:30 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-2.1 47ab4afed -> 20ae11722



[SPARK-19016][SQL][DOC] Document scalable partition handling

This PR documents the scalable partition handling feature in the body of the 
programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear 
that external datasource tables require an extra `MSCK REPAIR TABLE` command is 
to have per-partition information persisted since 2.1.

N/A.

Author: Cheng Lian <l...@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.

(cherry picked from commit 871f6114ac0075a1b45eda8701113fa20d647de9)
Signed-off-by: Cheng Lian <l...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/20ae1172
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/20ae1172
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/20ae1172

Branch: refs/heads/branch-2.1
Commit: 20ae11722d82cf3cdaa8c4023e37c1416664917d
Parents: 47ab4af
Author: Cheng Lian <l...@databricks.com>
Authored: Fri Dec 30 14:46:30 2016 -0800
Committer: Cheng Lian <l...@databricks.com>
Committed: Fri Dec 30 14:50:56 2016 -0800

----------------------------------------------------------------------
 docs/sql-programming-guide.md | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/20ae1172/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index d57f22e..58de0e1 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -515,7 +515,7 @@ new data.
 ### Saving to Persistent Tables
 
 `DataFrames` can also be saved as persistent tables into Hive metastore using 
the `saveAsTable`
-command. Notice existing Hive deployment is not necessary to use this feature. 
Spark will create a
+command. Notice that an existing Hive deployment is not necessary to use this 
feature. Spark will create a
 default local Hive metastore (using Derby) for you. Unlike the 
`createOrReplaceTempView` command,
 `saveAsTable` will materialize the contents of the DataFrame and create a 
pointer to the data in the
 Hive metastore. Persistent tables will still exist even after your Spark 
program has restarted, as
@@ -526,6 +526,18 @@ By default `saveAsTable` will create a "managed table", 
meaning that the locatio
 be controlled by the metastore. Managed tables will also have their data 
deleted automatically
 when a table is dropped.
 
+Currently, `saveAsTable` does not expose an API supporting the creation of an 
"external table" from a `DataFrame`.
+However, this functionality can be achieved by providing a `path` option to 
the `DataFrameWriter` with `path` as the key
+and location of the external table as its value (a string) when saving the 
table with `saveAsTable`. When an External table
+is dropped only its metadata is removed.
+
+Starting from Spark 2.1, persistent datasource tables have per-partition 
metadata stored in the Hive metastore. This brings several benefits:
+
+- Since the metastore can return only necessary partitions for a query, 
discovering all the partitions on the first query to the table is no longer 
needed.
+- Hive DDLs such as `ALTER TABLE PARTITION ... SET LOCATION` are now available 
for tables created with the Datasource API.
+
+Note that partition information is not gathered by default when creating 
external datasource tables (those with a `path` option). To sync the partition 
information in the metastore, you can invoke `MSCK REPAIR TABLE`.
+
 ## Parquet Files
 
 [Parquet](http://parquet.io) is a columnar format that is supported by many 
other data processing systems.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-19016][SQL][DOC] Document scalable partition handling

Reply via email to