spark-website git commit: More comprehensive new features

rxin Tue, 26 Jul 2016 15:29:32 -0700

Repository: spark-website
Updated Branches:
  refs/heads/asf-site 175d31a25 -> 7cd1fdf23



More comprehensive new features


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/7cd1fdf2
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/7cd1fdf2
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/7cd1fdf2

Branch: refs/heads/asf-site
Commit: 7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e
Parents: 175d31a
Author: Reynold Xin <r...@databricks.com>
Authored: Tue Jul 26 15:29:07 2016 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Tue Jul 26 15:29:07 2016 -0700

----------------------------------------------------------------------
 .../_posts/2016-07-27-spark-release-2-0-0.md    | 40 +++++++++-----
 site/releases/spark-release-2-0-0.html          | 58 +++++++++++++-------
 2 files changed, 66 insertions(+), 32 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark-website/blob/7cd1fdf2/releases/_posts/2016-07-27-spark-release-2-0-0.md
----------------------------------------------------------------------
diff --git a/releases/_posts/2016-07-27-spark-release-2-0-0.md 
b/releases/_posts/2016-07-27-spark-release-2-0-0.md
index 9969ce8..8d35967 100644
--- a/releases/_posts/2016-07-27-spark-release-2-0-0.md
+++ b/releases/_posts/2016-07-27-spark-release-2-0-0.md
@@ -34,38 +34,46 @@ One of the largest changes in Spark 2.0 is the new updated 
APIs:
  - SparkSession: new entry point that replaces the old SQLContext and 
HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept 
for backward compatibility.
  - A new, streamlined configuration API for SparkSession
  - Simpler, more performant accumulator API
+ - A new, improved Aggregator API for typed aggregation in Datasets
 
 
 #### SQL
 
 Spark 2.0 substantially improved SQL functionalities with SQL2003 support. 
Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:
 
+ - A native SQL parser that supports both ANSI-SQL as well as Hive QL
+ - Native DDL command implementations
  - Subquery support, including
- - Uncorrelated Scalar Subqueries
- - Correlated Scalar Subqueries
- - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
- - IN predicate subqueries (in WHERE/HAVING clauses)
- - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
+   - Uncorrelated Scalar Subqueries
+   - Correlated Scalar Subqueries
+   - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
+   - IN predicate subqueries (in WHERE/HAVING clauses)
+   - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
  - View canonicalization support
 
 In addition, when building without Hive support, Spark SQL should have almost 
all the functionality as when building with Hive support, with the exception of 
Hive connectivity, Hive UDFs, and script transforms.
 
 
-#### Performance
+#### New Features
+
+ - Native CSV data source, based on Databricks' [spark-csv 
module](https://github.com/databricks/spark-csv)
+ - Off-heap memory management for both caching and runtime execution
+ - Hive style bucketing support
+ - Approximate summary statistics using sketches, including approximate 
quantile, Bloom filter, and count-min sketch.
+
+
+#### Performance and Runtime
 
  - Substantial (2 - 10X) performance speedups for common operators in SQL and 
DataFrames via a new technique called whole stage code generation.
  - Improved Parquet scan throughput through vectorization
  - Improved ORC performance
  - Many improvements in the Catalyst query optimizer for common workloads
  - Improved window function performance via native implementations for all 
window functions
+ - Automatic file coalescing for native data sources
 
 
 ### MLlib
-The DataFrame-based API is now the primary API. The RDD-based API is entering 
maintenance mode. See the MLlib guide for details.
-
-#### API changes
-The largest API change is in linear algebra.  The DataFrame-based API 
(spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather 
than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* 
on spark.mllib.*.  (SPARK-13944)
-See the MLlib migration guide for a full list of API changes.
+The DataFrame-based API is now the primary API. The RDD-based API is entering 
maintenance mode. See the MLlib guide for details
 
 ####  New features
 
@@ -99,9 +107,14 @@ Spark 2.0 ships the initial experimental release for 
Structured Streaming, a hig
 For the DStream API, the most prominent update is the new experimental support 
for Kafka 0.10.
 
 
-### Operational and Packaging Improvements
+### Dependency and Packaging Improvements
+
+There are a variety of changes to Spark's operations and packaging process:
 
-There are a variety of improvements to Spark's operations and packaging 
process. The most prominent change is that Spark 2.0 no longer requires a fat 
assembly jar for production deployment.
+ - Spark 2.0 no longer requires a fat assembly jar for production deployment.
+ - Akka dependency has been removed, and as a result, user applications can 
program against any versions of Akka.
+ - Kryo version is bumped to 3.0.
+ - The default build is now using Scala 2.11 rather than Scala 2.10.
 
 
 ### Removals, Behavior Changes and Deprecations
@@ -134,6 +147,7 @@ The following changes might require updating existing 
applications that depend o
 - Java RDDâs flatMap and mapPartitions functions used to require functions 
returning Java Iterable. They have been updated to require functions returning 
Java iterator so the functions do not need to materialize all the data.
 - Java RDDâs countByKey and countAprroxDistinctByKey now returns a map from 
K to java.lang.Long, rather than to java.lang.Object.
 - When writing Parquet files, the summary files are not written by default. To 
re-enable it, users must set âparquet.enable.summary-metadataâ to true.
+- The DataFrame-based API (spark.ml) now depends upon local linear algebra in 
spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last 
dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib 
migration guide for a full list of API changes.
 
 
 For a more complete list, please see 
[SPARK-11806](https://issues.apache.org/jira/browse/SPARK-11806) for 
deprecations and removals.

http://git-wip-us.apache.org/repos/asf/spark-website/blob/7cd1fdf2/site/releases/spark-release-2-0-0.html
----------------------------------------------------------------------
diff --git a/site/releases/spark-release-2-0-0.html 
b/site/releases/spark-release-2-0-0.html
index ffa8255..cf6f86b 100644
--- a/site/releases/spark-release-2-0-0.html
+++ b/site/releases/spark-release-2-0-0.html
@@ -195,18 +195,18 @@
   <li><a href="#core-and-spark-sql">Core and Spark SQL</a>    <ul>
       <li><a href="#programming-apis">Programming APIs</a></li>
       <li><a href="#sql">SQL</a></li>
-      <li><a href="#performance">Performance</a></li>
+      <li><a href="#new-features">New Features</a></li>
+      <li><a href="#performance-and-runtime">Performance and Runtime</a></li>
     </ul>
   </li>
   <li><a href="#mllib">MLlib</a>    <ul>
-      <li><a href="#api-changes">API changes</a></li>
-      <li><a href="#new-features">New features</a></li>
+      <li><a href="#new-features-1">New features</a></li>
       <li><a href="#speedscaling">Speed/scaling</a></li>
     </ul>
   </li>
   <li><a href="#sparkr">SparkR</a></li>
   <li><a href="#streaming">Streaming</a></li>
-  <li><a href="#operational-and-packaging-improvements">Operational and 
Packaging Improvements</a></li>
+  <li><a href="#dependency-and-packaging-improvements">Dependency and 
Packaging Improvements</a></li>
   <li><a href="#removals-behavior-changes-and-deprecations">Removals, Behavior 
Changes and Deprecations</a>    <ul>
       <li><a href="#removals">Removals</a></li>
       <li><a href="#behavior-changes">Behavior Changes</a></li>
@@ -232,6 +232,7 @@
   <li>SparkSession: new entry point that replaces the old SQLContext and 
HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept 
for backward compatibility.</li>
   <li>A new, streamlined configuration API for SparkSession</li>
   <li>Simpler, more performant accumulator API</li>
+  <li>A new, improved Aggregator API for typed aggregation in Datasets</li>
 </ul>
 
 <h4 id="sql">SQL</h4>
@@ -239,18 +240,32 @@
 <p>Spark 2.0 substantially improved SQL functionalities with SQL2003 support. 
Spark SQL can now run all 99 TPC-DS queries. More prominently, we have 
improved:</p>
 
 <ul>
-  <li>Subquery support, including</li>
-  <li>Uncorrelated Scalar Subqueries</li>
-  <li>Correlated Scalar Subqueries</li>
-  <li>NOT IN predicate Subqueries (in WHERE/HAVING clauses)</li>
-  <li>IN predicate subqueries (in WHERE/HAVING clauses)</li>
-  <li>(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)</li>
+  <li>A native SQL parser that supports both ANSI-SQL as well as Hive QL</li>
+  <li>Native DDL command implementations</li>
+  <li>Subquery support, including
+    <ul>
+      <li>Uncorrelated Scalar Subqueries</li>
+      <li>Correlated Scalar Subqueries</li>
+      <li>NOT IN predicate Subqueries (in WHERE/HAVING clauses)</li>
+      <li>IN predicate subqueries (in WHERE/HAVING clauses)</li>
+      <li>(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)</li>
+    </ul>
+  </li>
   <li>View canonicalization support</li>
 </ul>
 
 <p>In addition, when building without Hive support, Spark SQL should have 
almost all the functionality as when building with Hive support, with the 
exception of Hive connectivity, Hive UDFs, and script transforms.</p>
 
-<h4 id="performance">Performance</h4>
+<h4 id="new-features">New Features</h4>
+
+<ul>
+  <li>Native CSV data source, based on Databricks&#8217; <a 
href="https://github.com/databricks/spark-csv";>spark-csv module</a></li>
+  <li>Off-heap memory management for both caching and runtime execution</li>
+  <li>Hive style bucketing support</li>
+  <li>Approximate summary statistics using sketches, including approximate 
quantile, Bloom filter, and count-min sketch.</li>
+</ul>
+
+<h4 id="performance-and-runtime">Performance and Runtime</h4>
 
 <ul>
   <li>Substantial (2 - 10X) performance speedups for common operators in SQL 
and DataFrames via a new technique called whole stage code generation.</li>
@@ -258,16 +273,13 @@
   <li>Improved ORC performance</li>
   <li>Many improvements in the Catalyst query optimizer for common 
workloads</li>
   <li>Improved window function performance via native implementations for all 
window functions</li>
+  <li>Automatic file coalescing for native data sources</li>
 </ul>
 
 <h3 id="mllib">MLlib</h3>
-<p>The DataFrame-based API is now the primary API. The RDD-based API is 
entering maintenance mode. See the MLlib guide for details.</p>
+<p>The DataFrame-based API is now the primary API. The RDD-based API is 
entering maintenance mode. See the MLlib guide for details</p>
 
-<h4 id="api-changes">API changes</h4>
-<p>The largest API change is in linear algebra.  The DataFrame-based API 
(spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather 
than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* 
on spark.mllib.*.  (SPARK-13944)
-See the MLlib migration guide for a full list of API changes.</p>
-
-<h4 id="new-features">New features</h4>
+<h4 id="new-features-1">New features</h4>
 
 <ul>
   <li>ML persistence: The DataFrames-based API provides near-complete support 
for saving and loading ML models and Pipelines in Scala, Java, Python, and R.  
See this blog post for details.  (SPARK-6725, SPARK-11939, SPARK-14311)</li>
@@ -300,9 +312,16 @@ See the MLlib migration guide for a full list of API 
changes.</p>
 
 <p>For the DStream API, the most prominent update is the new experimental 
support for Kafka 0.10.</p>
 
-<h3 id="operational-and-packaging-improvements">Operational and Packaging 
Improvements</h3>
+<h3 id="dependency-and-packaging-improvements">Dependency and Packaging 
Improvements</h3>
+
+<p>There are a variety of changes to Spark&#8217;s operations and packaging 
process:</p>
 
-<p>There are a variety of improvements to Spark&#8217;s operations and 
packaging process. The most prominent change is that Spark 2.0 no longer 
requires a fat assembly jar for production deployment.</p>
+<ul>
+  <li>Spark 2.0 no longer requires a fat assembly jar for production 
deployment.</li>
+  <li>Akka dependency has been removed, and as a result, user applications can 
program against any versions of Akka.</li>
+  <li>Kryo version is bumped to 3.0.</li>
+  <li>The default build is now using Scala 2.11 rather than Scala 2.10.</li>
+</ul>
 
 <h3 id="removals-behavior-changes-and-deprecations">Removals, Behavior Changes 
and Deprecations</h3>
 
@@ -337,6 +356,7 @@ See the MLlib migration guide for a full list of API 
changes.</p>
   <li>Java RDDâs flatMap and mapPartitions functions used to require 
functions returning Java Iterable. They have been updated to require functions 
returning Java iterator so the functions do not need to materialize all the 
data.</li>
   <li>Java RDDâs countByKey and countAprroxDistinctByKey now returns a map 
from K to java.lang.Long, rather than to java.lang.Object.</li>
   <li>When writing Parquet files, the summary files are not written by 
default. To re-enable it, users must set âparquet.enable.summary-metadataâ 
to true.</li>
+  <li>The DataFrame-based API (spark.ml) now depends upon local linear algebra 
in spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last 
dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib 
migration guide for a full list of API changes.</li>
 </ul>
 
 <p>For a more complete list, please see <a 
href="https://issues.apache.org/jira/browse/SPARK-11806";>SPARK-11806</a> for 
deprecations and removals.</p>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: More comprehensive new features

Reply via email to