spark git commit: [SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide

marmbrus Thu, 27 Aug 2015 11:45:43 -0700

Repository: spark
Updated Branches:
  refs/heads/master fdd466bed -> dc86a227e



[SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide

Author: Michael Armbrust <mich...@databricks.com>

Closes #8441 from marmbrus/documentation.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc86a227
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc86a227
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc86a227

Branch: refs/heads/master
Commit: dc86a227e4fc8a9d8c3e8c68da8dff9298447fd0
Parents: fdd466b
Author: Michael Armbrust <mich...@databricks.com>
Authored: Thu Aug 27 11:45:15 2015 -0700
Committer: Michael Armbrust <mich...@databricks.com>
Committed: Thu Aug 27 11:45:15 2015 -0700

----------------------------------------------------------------------
 docs/sql-programming-guide.md | 92 ++++++++++++++++++++++++++++++--------
 1 file changed, 73 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/dc86a227/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index e64190b..99fec6c 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -11,7 +11,7 @@ title: Spark SQL and DataFrames
 
 Spark SQL is a Spark module for structured data processing. It provides a 
programming abstraction called DataFrames and can also act as distributed SQL 
query engine.
 
-For how to enable Hive support, please refer to the [Hive 
Tables](#hive-tables) section.
+Spark SQL can also be used to read data from an existing Hive installation.  
For more on how to configure this feature, please refer to the [Hive 
Tables](#hive-tables) section.
 
 # DataFrames
 
@@ -213,6 +213,11 @@ df.groupBy("age").count().show()
 // 30   1
 {% endhighlight %}
 
+For a complete list of the types of operations that can be performed on a 
DataFrame refer to the [API 
Documentation](api/scala/index.html#org.apache.spark.sql.DataFrame).
+
+In addition to simple column references and expressions, DataFrames also have 
a rich library of functions including string manipulation, date arithmetic, 
common math operations and more.  The complete list is available in the 
[DataFrame Function 
Reference](api/scala/index.html#org.apache.spark.sql.DataFrame).
+
+
 </div>
 
 <div data-lang="java" markdown="1">
@@ -263,6 +268,10 @@ df.groupBy("age").count().show();
 // 30   1
 {% endhighlight %}
 
+For a complete list of the types of operations that can be performed on a 
DataFrame refer to the [API 
Documentation](api/java/org/apache/spark/sql/DataFrame.html).
+
+In addition to simple column references and expressions, DataFrames also have 
a rich library of functions including string manipulation, date arithmetic, 
common math operations and more.  The complete list is available in the 
[DataFrame Function Reference](api/java/org/apache/spark/sql/functions.html).
+
 </div>
 
 <div data-lang="python"  markdown="1">
@@ -320,6 +329,10 @@ df.groupBy("age").count().show()
 
 {% endhighlight %}
 
+For a complete list of the types of operations that can be performed on a 
DataFrame refer to the [API 
Documentation](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
+
+In addition to simple column references and expressions, DataFrames also have 
a rich library of functions including string manipulation, date arithmetic, 
common math operations and more.  The complete list is available in the 
[DataFrame Function 
Reference](api/python/pyspark.sql.html#module-pyspark.sql.functions).
+
 </div>
 
 <div data-lang="r"  markdown="1">
@@ -370,10 +383,13 @@ showDF(count(groupBy(df, "age")))
 
 {% endhighlight %}
 
-</div>
+For a complete list of the types of operations that can be performed on a 
DataFrame refer to the [API Documentation](api/R/index.html).
+
+In addition to simple column references and expressions, DataFrames also have 
a rich library of functions including string manipulation, date arithmetic, 
common math operations and more.  The complete list is available in the 
[DataFrame Function Reference](api/R/index.html).
 
 </div>
 
+</div>
 
 ## Running SQL Queries Programmatically
 
@@ -870,12 +886,11 @@ saveDF(select(df, "name", "age"), "namesAndAges.parquet", 
"parquet")
 
 Save operations can optionally take a `SaveMode`, that specifies how to handle 
existing data if
 present.  It is important to realize that these save modes do not utilize any 
locking and are not
-atomic.  Thus, it is not safe to have multiple writers attempting to write to 
the same location.
-Additionally, when performing a `Overwrite`, the data will be deleted before 
writing out the
+atomic.  Additionally, when performing a `Overwrite`, the data will be deleted 
before writing out the
 new data.
 
 <table class="table">
-<tr><th>Scala/Java</th><th>Python</th><th>Meaning</th></tr>
+<tr><th>Scala/Java</th><th>Any Language</th><th>Meaning</th></tr>
 <tr>
   <td><code>SaveMode.ErrorIfExists</code> (default)</td>
   <td><code>"error"</code> (default)</td>
@@ -1671,12 +1686,12 @@ results <- collect(sql(sqlContext, "FROM src SELECT 
key, value"))
 ### Interacting with Different Versions of Hive Metastore
 
 One of the most important pieces of Spark SQL's Hive support is interaction 
with Hive metastore,
-which enables Spark SQL to access metadata of Hive tables. Starting from Spark 
1.4.0, a single binary build of Spark SQL can be used to query different 
versions of Hive metastores, using the configuration described below.
+which enables Spark SQL to access metadata of Hive tables. Starting from Spark 
1.4.0, a single binary 
+build of Spark SQL can be used to query different versions of Hive metastores, 
using the configuration described below.
+Note that independent of the version of Hive that is being used to talk to the 
metastore, internally Spark SQL
+will compile against Hive 1.2.1 and use those classes for internal execution 
(serdes, UDFs, UDAFs, etc).
 
-Internally, Spark SQL uses two Hive clients, one for executing native Hive 
commands like `SET`
-and `DESCRIBE`, the other dedicated for communicating with Hive metastore. The 
former uses Hive
-jars of version 0.13.1, which are bundled with Spark 1.4.0. The latter uses 
Hive jars of the
-version specified by users. An isolated classloader is used here to avoid 
dependency conflicts.
+The following options can be used to configure the version of Hive that is 
used to retrieve metadata:
 
 <table class="table">
   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
@@ -1685,7 +1700,7 @@ version specified by users. An isolated classloader is 
used here to avoid depend
     <td><code>0.13.1</code></td>
     <td>
       Version of the Hive metastore. Available
-      options are <code>0.12.0</code> and <code>0.13.1</code>. Support for 
more versions is coming in the future.
+      options are <code>0.12.0</code> through <code>1.2.1</code>.
     </td>
   </tr>
   <tr>
@@ -1696,12 +1711,16 @@ version specified by users. An isolated classloader is 
used here to avoid depend
       property can be one of three options:
       <ol>
         <li><code>builtin</code></li>
-        Use Hive 0.13.1, which is bundled with the Spark assembly jar when 
<code>-Phive</code> is
+        Use Hive 1.2.1, which is bundled with the Spark assembly jar when 
<code>-Phive</code> is
         enabled. When this option is chosen, 
<code>spark.sql.hive.metastore.version</code> must be
-        either <code>0.13.1</code> or not defined.
+        either <code>1.2.1</code> or not defined.
         <li><code>maven</code></li>
-        Use Hive jars of specified version downloaded from Maven repositories.
-        <li>A classpath in the standard format for both Hive and Hadoop.</li>
+        Use Hive jars of specified version downloaded from Maven repositories. 
 This configuration
+        is not generally recommended for production deployments. 
+        <li>A classpath in the standard format for the JVM.  This classpath 
must include all of Hive 
+        and its dependencies, including the correct version of Hadoop.  These 
jars only need to be
+        present on the driver, but if you are running in yarn cluster mode 
then you must ensure
+        they are packaged with you application.</li>
       </ol>
     </td>
   </tr>
@@ -2017,6 +2036,28 @@ options.
 
 # Migration Guide
 
+## Upgrading From Spark SQL 1.4 to 1.5
+
+ - Optimized execution using manually managed memory (Tungsten) is now enabled 
by default, along with
+   code generation for expression evaluation.  These features can both be 
disabled by setting
+   `spark.sql.tungsten.enabled` to `false.
+ - Parquet schema merging is no longer enabled by default.  It can be 
re-enabled by setting 
+   `spark.sql.parquet.mergeSchema` to `true`.
+ - Resolution of strings to columns in python now supports using dots (`.`) to 
qualify the column or 
+   access nested values.  For example `df['table.column.nestedField']`.  
However, this means that if 
+   your column name contains any dots you must now escape them using backticks 
(e.g., ``table.`column.with.dots`.nested``).   
+ - In-memory columnar storage partition pruning is on by default. It can be 
disabled by setting
+   `spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
+ - Unlimited precision decimal columns are no longer supported, instead Spark 
SQL enforces a maximum
+   precision of 38.  When inferring schema from `BigDecimal` objects, a 
precision of (38, 18) is now
+   used. When no precision is specified in DDL then the default remains 
`Decimal(10, 0)`.
+ - Timestamps are now stored at a precision of 1us, rather than 1ns
+ - In the `sql` dialect, floating point numbers are now parsed as decimal.  
HiveQL parsing remains
+   unchanged.
+ - The canonical name of SQL/DataFrame functions are now lower case (e.g. sum 
vs SUM).
+ - It has been determined that using the DirectOutputCommitter when 
speculation is enabled is unsafe
+   and thus this output committer will not be used when speculation is on, 
independent of configuration.
+
 ## Upgrading from Spark SQL 1.3 to 1.4
 
 #### DataFrame data reader/writer interface
@@ -2038,7 +2079,8 @@ See the API docs for `SQLContext.read` (
 
 #### DataFrame.groupBy retains grouping columns
 
-Based on user feedback, we changed the default behavior of 
`DataFrame.groupBy().agg()` to retain the grouping columns in the resulting 
`DataFrame`. To keep the behavior in 1.3, set `spark.sql.retainGroupColumns` to 
`false`.
+Based on user feedback, we changed the default behavior of 
`DataFrame.groupBy().agg()` to retain the
+grouping columns in the resulting `DataFrame`. To keep the behavior in 1.3, 
set `spark.sql.retainGroupColumns` to `false`.
 
 <div class="codetabs">
 <div data-lang="scala"  markdown="1">
@@ -2175,7 +2217,7 @@ Python UDF registration is unchanged.
 When using DataTypes in Python you will need to construct them (i.e. 
`StringType()`) instead of
 referencing a singleton.
 
-## Migration Guide for Shark User
+## Migration Guide for Shark Users
 
 ### Scheduling
 To set a [Fair Scheduler](job-scheduling.html#fair-scheduler-pools) pool for a 
JDBC client session,
@@ -2251,6 +2293,7 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * User defined functions (UDF)
 * User defined aggregation functions (UDAF)
 * User defined serialization formats (SerDes)
+* Window functions
 * Joins
   * `JOIN`
   * `{LEFT|RIGHT|FULL} OUTER JOIN`
@@ -2261,7 +2304,7 @@ Spark SQL supports the vast majority of Hive features, 
such as:
   * `SELECT col FROM ( SELECT a + b AS col from t1) t2`
 * Sampling
 * Explain
-* Partitioned tables
+* Partitioned tables including dynamic partition insertion
 * View
 * All Hive DDL Functions, including:
   * `CREATE TABLE`
@@ -2323,8 +2366,9 @@ releases of Spark SQL.
   Hive can optionally merge the small files into fewer large files to avoid 
overflowing the HDFS
   metadata. Spark SQL does not support that.
 
+# Reference
 
-# Data Types
+## Data Types
 
 Spark SQL and DataFrames support the following data types:
 
@@ -2937,3 +2981,13 @@ from pyspark.sql.types import *
 
 </div>
 
+## NaN Semantics
+
+There is specially handling for not-a-number (NaN) when dealing with `float` 
or `double` types that
+does not exactly match standard floating point semantics.
+Specifically:
+
+ - NaN = NaN returns true.
+ - In aggregations all NaN values are grouped together.
+ - NaN is treated as a normal value in join keys.
+ - NaN values go last when in ascending order, larger than any other numeric 
value.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9148] [SPARK-10252] [SQL] Update SQL Programming Guide

Reply via email to