spark-website git commit: Add 2.0.0 release notes

rxin Tue, 26 Jul 2016 14:41:43 -0700

Repository: spark-website
Updated Branches:
  refs/heads/asf-site effcd547b -> 0323eb078



Add 2.0.0 release notes


Project: http://git-wip-us.apache.org/repos/asf/spark-website/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark-website/commit/0323eb07
Tree: http://git-wip-us.apache.org/repos/asf/spark-website/tree/0323eb07
Diff: http://git-wip-us.apache.org/repos/asf/spark-website/diff/0323eb07

Branch: refs/heads/asf-site
Commit: 0323eb0787282068ce34de0b953a5c3a5c24e84c
Parents: effcd54
Author: Reynold Xin <r...@databricks.com>
Authored: Tue Jul 26 14:41:18 2016 -0700
Committer: Reynold Xin <r...@databricks.com>
Committed: Tue Jul 26 14:41:18 2016 -0700

----------------------------------------------------------------------
 .../_posts/2016-07-27-spark-release-2-0-0.md    | 145 ++++++++
 site/releases/spark-release-2-0-0.html          | 372 +++++++++++++++++++
 2 files changed, 517 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark-website/blob/0323eb07/releases/_posts/2016-07-27-spark-release-2-0-0.md
----------------------------------------------------------------------
diff --git a/releases/_posts/2016-07-27-spark-release-2-0-0.md 
b/releases/_posts/2016-07-27-spark-release-2-0-0.md
new file mode 100644
index 0000000..8b26c04
--- /dev/null
+++ b/releases/_posts/2016-07-27-spark-release-2-0-0.md
@@ -0,0 +1,145 @@
+---
+layout: post
+title: Spark Release 2.0.0
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+  _wpas_done_all: '1'
+---
+
+Apache Spark 2.0.0 is the first release on the 2.x line. This release includes 
over 2500 patches from over 300 contributors. Some breaking changes have been 
made with respect to the 1.x line. To download Apache Spark 2.0.0, visit the 
[downloads](http://spark.apache.org/downloads.html) page.
+
+You can consult JIRA for the [detailed 
changes](https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12329449).
 We have curated a list of high level changes here, grouped by major modules.
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+
+### Core and Spark SQL
+
+#### Programming APIs
+
+One of the largest changes in Spark 2.0 is the new updated APIs:
+
+ - Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset 
have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In 
Python and R, given the lack of type safety, DataFrame is the main programming 
interface.
+ - SparkSession: new entry point that replaces the old SQLContext and 
HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept 
for backward compatibility.
+ - A new, streamlined configuration API for SparkSession
+ - Simpler, more performant accumulator API
+
+
+#### SQL
+
+Spark 2.0 substantially improved SQL functionalities with SQL2003 support. 
Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:
+
+ - Subquery support, including
+ - Uncorrelated Scalar Subqueries
+ - Correlated Scalar Subqueries
+ - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
+ - IN predicate subqueries (in WHERE/HAVING clauses)
+ - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
+ - View canonicalization support
+
+In addition, when building without Hive support, Spark SQL should have almost 
all the functionality as when building with Hive support, with the exception of 
Hive connectivity, Hive UDFs, and script transforms.
+
+
+#### Performance
+
+ - Substantial (2 - 10X) performance speedups for common operators in SQL and 
DataFrames via a new technique called whole stage code generation.
+ - Improved Parquet scan throughput through vectorization
+ - Improved ORC performance
+ - Many improvements in the Catalyst query optimizer for common workloads
+ - Improved window function performance via native implementations for all 
window functions
+
+
+### MLlib
+The DataFrame-based API is now the primary API. The RDD-based API is entering 
maintenance mode. See the MLlib guide for details.
+
+#### API changes
+The largest API change is in linear algebra.  The DataFrame-based API 
(spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather 
than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* 
on spark.mllib.*.  (SPARK-13944)
+See the MLlib migration guide for a full list of API changes.
+
+####  New features
+
+- ML persistence: The DataFrames-based API provides near-complete support for 
saving and loading ML models and Pipelines in Scala, Java, Python, and R.  See 
this blog post for details.  (SPARK-6725, SPARK-11939, SPARK-14311)
+- MLlib in R: SparkR now offers MLlib APIs for generalized linear models, 
naive Bayes, k-means clustering, and survival regression.  See this talk to 
learn more.
+- Python: PySpark now offers many more MLlib algorithms, including LDA, 
Gaussian Mixture Model, Generalized Linear Regression, and more.
+- Algorithms added to DataFrames-based API: Bisecting K-Means clustering, 
Gaussian Mixture Model, MaxAbsScaler feature transformer.
+
+This talk lists many of these new features.
+
+#### Speed/scaling
+Vectors and Matrices stored in DataFrames now use much more efficient 
serialization, reducing overhead in calling MLlib algorithms. (SPARK-14850)
+
+
+### SparkR
+
+The largest improvement to SparkR in Spark 2.0 is user-defined functions. 
There are three user-defined functions: dapply, gapply, and lapply. The first 
two can be used to do partition-based UDFs using dapply and gapply, e.g. 
partitioned model learning. The latter can be used to do hyper-parameter tuning.
+
+In addition, there are a number of new features:
+
+- Improved algorithm coverage for machine learning in R, including naive 
Bayes, k-means clustering, and survival regression.
+- Generalized linear models support more families and link functions.
+- Save and load for all ML models.
+- More DataFrame functionality: Window functions API, reader, writer support 
for JDBC, CSV, SparkSession
+
+
+### Streaming
+
+Spark 2.0 ships the initial experimental release for Structured Streaming, a 
high level streaming API built on top of Spark SQL and the Catalyst optimizer. 
Structured Streaming enables users to program against streaming sources and 
sinks using the same DataFrame/Dataset API as in static data sources, 
leveraging the Catalyst optimizer to automatically incrementalize the query 
plans.
+
+For the DStream API, the most prominent update is the new experimental support 
for Kafka 0.10.
+
+
+### Removals, Deprecations and Breaking Changes
+
+#### Removals
+The following features have been removed in Spark 2.0:
+
+- Bagel
+- Support for Hadoop 2.1 and earlier
+- The ability to configure closure serializer
+- HTTPBroadcast
+- TTL-based metadata cleaning
+- Semi-private class org.apache.spark.Logging. We suggest you use slf4j 
directly.
+- SparkContext.metricsSystem
+- Block-oriented integration with Tachyon (subsumed by file system integration)
+- Methods deprecated in Spark 1.x
+- Methods on Python DataFrame that returned RDDs (map, flatMap, mapPartitions, 
etc). They are still available in dataframe.rdd field, e.g. dataframe.rdd.map.
+- Less frequently used streaming connectors, including Twitter, Akka, MQTT, 
ZeroMQ
+- Hash-based shuffle manager
+- History serving functionality from standalone Master
+- For Java and Scala, DataFrame no longer exists as a class. As a result, data 
sources would need to be updated.
+
+#### Deprecations
+The following features have been deprecated in Spark 2.0, and might be removed 
in future versions of Spark 2.x:
+
+- Fine-grained mode in Apache Mesos
+- Support for Java 7
+- Support for Python 2.6
+
+#### Breaking Changes
+The following changes might require updating existing applications that depend 
on the old behavior or API.
+
+- The default build is now using Scala 2.11 rather than Scala 2.10.
+- In SQL, floating literals are now parsed as decimal data type rather than 
double data type.
+- Kryo version is bumped to 3.0.
+- Java RDDâs flatMap and mapPartitions functions used to require functions 
returning Java Iterable. They have been updated to require functions returning 
Java iterator so the functions do not need to materialize all the data.
+- Java RDDâs countByKey and countAprroxDistinctByKey now returns a map from 
K to java.lang.Long, rather than to java.lang.Object.
+- When writing Parquet files, the summary files are not written by default. To 
re-enable it, users must set âparquet.enable.summary-metadataâ to true.
+
+
+For a more complete list, please see 
[SPARK-11806](https://issues.apache.org/jira/browse/SPARK-11806) for 
deprecations and removals.
+
+
+### Known Issues
+
+- Lead and Lag's behaviors have been changed to ignoring nulls from respecting 
nulls (1.6's behaviors). In 2.0.1, the behavioral changes will be fixed in 
2.0.1 (SPARK-16721).
+- Lead and Lag functions using constant input values does not return the 
default value when the offset row does not exist (SPARK-16633).
+
+
+### Credits
+Last but not least, this release would not have been possible without the 
following contributors: Aaron Tokhy, Abhinav Gupta, Abou Haydar Elias, Adam 
Budde, Adam Roberts, Ahmed Kamal, Ahmed Mahran, Alex Bozarth, Alexander Ulanov, 
Allen, Anatoliy Plastinin, Andrew, Andrew Ash, Andrew Or, Andrew Ray, Anthony 
Truchet, Antonio Murgia, Arun Allamsetty, Azeem Jiva, Ben McCann, BenFradet, 
Bertrand Bossy, Bill Chambers, Bjorn Jonsson, Bo Meng, Brandon Bradley, Brian 
O'Neill, BrianLondon, Bryan Cutler, Burak KÃ¶se, Burak Yavuz, Carson Wang, 
Cazen, Charles Allen, Cheng Hao, Cheng Lian, Claes Redestad, CodingCat, DB 
Tsai, DLucky, Daniel Jalova, Daoyuan Wang, Darek Blasiak, David Tolpin, Davies 
Liu, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Erastov, Dominik 
JastrzÄbski, Dongjoon Hyun, Earthson Lu, Egor Pakhomov, Ehsan M.Kermani, Ergin 
Seyfe, Eric Liang, Ernest, Felix Cheung, Feynman Liang, Fokko Driesprong, 
Franklyn D'souza, FranÃ§ois Garillot, Gabriele Nizzoli, Gary King, 
GayathriMurali, 
 Gio Borje, Grace, Grzegorz Chilkiewicz, Guillaume Poulin, GÃ¡bor LiptÃ¡k, 
Hemant Bhanawat, Herman van Hovell, Herman van HÃ¶vell tot Westerflier, Hiroshi 
Inoue, Holden Karau, Hossein, Huaxin Gao, Imran Rashid, Imran Younus, Ioana 
Delaney, Iulian Dragos, Jacek Laskowski, Jacek Lewandowski, Jakob Odersky, 
James Lohse, James Thomas, Jason Lee, Jason Moore, Jason White, Jean-Baptiste 
OnofrÃ©, Jeff L, Jeff Zhang, Jeremy Derr, JeremyNixon, Jo Voordeckers, Joan, 
Jon Maurer, Joseph K. Bradley, Josh Howes, Josh Rosen, Joshi, Juarez Bochi, 
Julien Baley, Junyang, Junyang Qian, Jurriaan Pruis, Kai Jiang, KaiXinXiaoLei, 
Kay Ousterhout, Kazuaki Ishizaki, Kevin Yu, Koert Kuipers, Kousuke Saruta, Koyo 
Yoshida, Krishna Kalyan, Lewuathe, Liang-Chi Hsieh, Lianhui Wang, Lin Zhao, 
Lining Sun, Liu Xiang, Liwei Lin, Luc Bourlier, Luciano Resende, Lukasz, Maciej 
Brynski, Malte, Marcelo Vanzin, Marcin Tustin, Mark Grover, Martin Menestret, 
Masayoshi TSUZUKI, Matei Zaharia, Matthew Wise, Michael Allman, Mich
 ael Armbrust, Michael Gummelt, Michel Lemay, Mike Dusenberry, Mortada Mehyar, 
Nakul Jindal, Nam Pham, Narine Kokhlikyan, NarineK, Neelesh Srinivas Salian, 
Nezih Yigitbasi, Nicholas Chammas, Nicholas Tietz, Nick Pentreath, Nilanjan 
Raychaudhuri, Nirman Narang, Nishkam Ravi, Nong, Nong Li, Oleg Danilov, Oliver 
Pierson, Oscar D. Lara Yejas, Parth Brahmbhatt, Patrick Wendell, Pete Robbins, 
Peter Ableda, Prajwal Tuladhar, Prashant Sharma, Pravin Gadakh, QiangCai, Qifan 
Pu, Raafat Akkad, Rahul Tanwani, Rajesh Balamohan, Rekha Joshi, Reynold Xin, 
Richard W. Eggert II, Robert Dodier, Robert Kruszewski, Robin East, Ruifeng 
Zheng, Ryan Blue, Sameer Agarwal, Sandeep Singh, Sanket, Sasaki Toru, Sean 
Owen, Sean Zhong, Sebastien Rainville, SebastiÃ¡n RamÃrez, Sela, Sergiusz 
Urbaniak, Shally Sangal, Sheamus K. Parkes, Shivaram Venkataraman, Shixiong 
Zhu, Shuai Lin, Shubhanshu Mishra, Sital Kedia, Stavros Kontopoulos, Stephan 
Kessler, Steve Loughran, Subhobrata Dey, Subroto Sanyal, Sumedh Mungee, 
 Sun Rui, Sunitha Kambhampati, Takahashi Hiroshi, Takeshi YAMAMURO, Takuya 
Kuwahara, Takuya UESHIN, Tathagata Das, Tejas Patil, Terence Yim, Thomas 
Graves, Timothy Chen, Timothy Hunter, Tom Graves, Tom Magrino, Tommy YU, Travis 
Crawford, Tristan Reid, Victor Chima, Villu Ruusmann, Wayne Song, WeichenXu, 
Weiqing Yang, Wenchen Fan, Wesley Tang, Wilson Wu, Wojciech Jurczyk, Xiangrui 
Meng, Xin Ren, Xin Wu, Xinh Huynh, Xiu Guo, Xusen Yin, Yadong Qi, Yanbo Liang, 
Yash Datta, Yin Huai, Yonathan Randolph, Yong Gang Cao, Yong Tang, Yu ISHIKAWA, 
Yucai Yu, Yuhao Yang, Yury Liavitski, Zhang, Liye, Zheng RuiFeng, Zheng Tan, 
aokolnychyi, bomeng, catapan, cody koeninger, dding3, depend, echo2mei, 
felixcheung, frreiss, fwang1, gatorsmile, guoxu1231, huangzhaowei, hushan, 
hyukjinkwon, jayadevanmurali, jeanlyn, jerryshao, jliwork, junhao, 
kaklakariada, krishnakalyan3, lfzCarlosC, lgieron, mark800, mathieu longtin, 
mcheah, meiyoula, movelikeriver, mwws, nfraison, oraviv, peng.zhang, 
petermaxlee, pierre
 -borckmans, poolis, prabs, proflin, pshearer, rotems, sachin aggarwal, sandy, 
scwf, seddonm1, sethah, sharkd, shijinkui, sureshthalamati, tedyu, thomastechs, 
tmnd1991, vijaykiran, wangfei, wangyang, wm...@hotmail.com, wujian, xin Wu, 
yzhou2001, zero323, zhonghaihua, zhuol, zlpmichelle, Ãrjan Lundberg, Yang Bo.

http://git-wip-us.apache.org/repos/asf/spark-website/blob/0323eb07/site/releases/spark-release-2-0-0.html
----------------------------------------------------------------------
diff --git a/site/releases/spark-release-2-0-0.html 
b/site/releases/spark-release-2-0-0.html
new file mode 100644
index 0000000..e260a0f
--- /dev/null
+++ b/site/releases/spark-release-2-0-0.html
@@ -0,0 +1,372 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+
+  <title>
+     Spark Release 2.0.0 | Apache Spark
+    
+  </title>
+
+  
+
+  
+
+  <!-- Bootstrap core CSS -->
+  <link href="/css/cerulean.min.css" rel="stylesheet">
+  <link href="/css/custom.css" rel="stylesheet">
+
+  <!-- Code highlighter CSS -->
+  <link href="/css/pygments-default.css" rel="stylesheet">
+
+  <script type="text/javascript">
+  <!-- Google Analytics initialization -->
+  var _gaq = _gaq || [];
+  _gaq.push(['_setAccount', 'UA-32518208-2']);
+  _gaq.push(['_trackPageview']);
+  (function() {
+    var ga = document.createElement('script'); ga.type = 'text/javascript'; 
ga.async = true;
+    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 
'http://www') + '.google-analytics.com/ga.js';
+    var s = document.getElementsByTagName('script')[0]; 
s.parentNode.insertBefore(ga, s);
+  })();
+
+  <!-- Adds slight delay to links to allow async reporting -->
+  function trackOutboundLink(link, category, action) {
+    try {
+      _gaq.push(['_trackEvent', category , action]);
+    } catch(err){}
+
+    setTimeout(function() {
+      document.location.href = link.href;
+    }, 100);
+  }
+  </script>
+
+  <!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media 
queries -->
+  <!--[if lt IE 9]>
+  <script 
src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js";></script>
+  <script 
src="https://oss.maxcdn.com/libs/respond.js/1.3.0/respond.min.js";></script>
+  <![endif]-->
+</head>
+
+<body>
+
+<script src="https://code.jquery.com/jquery.js";></script>
+<script 
src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></script>
+<script src="/js/lang-tabs.js"></script>
+<script src="/js/downloads.js"></script>
+
+<div class="container" style="max-width: 1200px;">
+
+<div class="masthead">
+  
+    <p class="lead">
+      <a href="/">
+      <img src="/images/spark-logo-trademark.png"
+        style="height:100px; width:auto; vertical-align: bottom; margin-top: 
20px;"></a><span class="tagline">
+          Lightning-fast cluster computing
+      </span>
+    </p>
+  
+</div>
+
+<nav class="navbar navbar-default" role="navigation">
+  <!-- Brand and toggle get grouped for better mobile display -->
+  <div class="navbar-header">
+    <button type="button" class="navbar-toggle" data-toggle="collapse"
+            data-target="#navbar-collapse-1">
+      <span class="sr-only">Toggle navigation</span>
+      <span class="icon-bar"></span>
+      <span class="icon-bar"></span>
+      <span class="icon-bar"></span>
+    </button>
+  </div>
+
+  <!-- Collect the nav links, forms, and other content for toggling -->
+  <div class="collapse navbar-collapse" id="navbar-collapse-1">
+    <ul class="nav navbar-nav">
+      <li><a href="/downloads.html">Download</a></li>
+      <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">
+          Libraries <b class="caret"></b>
+        </a>
+        <ul class="dropdown-menu">
+          <li><a href="/sql/">SQL and DataFrames</a></li>
+          <li><a href="/streaming/">Spark Streaming</a></li>
+          <li><a href="/mllib/">MLlib (machine learning)</a></li>
+          <li><a href="/graphx/">GraphX (graph)</a></li>
+          <li class="divider"></li>
+          <li><a href="http://spark-packages.org";>Third-Party Packages</a></li>
+        </ul>
+      </li>
+      <li class="dropdown">
+        <a href="#" class="dropdown-toggle" data-toggle="dropdown">
+          Documentation <b class="caret"></b>
+        </a>
+        <ul class="dropdown-menu">
+          <li><a href="/docs/latest/">Latest Release (Spark 1.6.2)</a></li>
+          <li><a href="/documentation.html">Other Resources</a></li>
+        </ul>
+      </li>
+      <li><a href="/examples.html">Examples</a></li>
+      <li class="dropdown">
+        <a href="/community.html" class="dropdown-toggle" 
data-toggle="dropdown">
+          Community <b class="caret"></b>
+        </a>
+        <ul class="dropdown-menu">
+          <li><a href="/community.html">Mailing Lists</a></li>
+          <li><a href="/community.html#events">Events and Meetups</a></li>
+          <li><a href="/community.html#history">Project History</a></li>
+          <li><a 
href="https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark";>Powered
 By</a></li>
+          <li><a 
href="https://cwiki.apache.org/confluence/display/SPARK/Committers";>Project 
Committers</a></li>
+          <li><a href="https://issues.apache.org/jira/browse/SPARK";>Issue 
Tracker</a></li>
+        </ul>
+      </li>
+      <li><a href="/faq.html">FAQ</a></li>
+    </ul>
+    <ul class="nav navbar-nav navbar-right">
+      <li class="dropdown">
+        <a href="http://www.apache.org/"; class="dropdown-toggle" 
data-toggle="dropdown">
+          Apache Software Foundation <b class="caret"></b></a>
+        <ul class="dropdown-menu">
+          <li><a href="http://www.apache.org/";>Apache Homepage</a></li>
+          <li><a href="http://www.apache.org/licenses/";>License</a></li>
+          <li><a 
href="http://www.apache.org/foundation/sponsorship.html";>Sponsorship</a></li>
+          <li><a 
href="http://www.apache.org/foundation/thanks.html";>Thanks</a></li>
+          <li><a href="http://www.apache.org/security/";>Security</a></li>
+        </ul>
+      </li>
+    </ul>
+  </div>
+  <!-- /.navbar-collapse -->
+</nav>
+
+
+<div class="row">
+  <div class="col-md-3 col-md-push-9">
+    <div class="news" style="margin-bottom: 20px;">
+      <h5>Latest News</h5>
+      <ul class="list-unstyled">
+        
+          <li><a href="/news/spark-1-6-2-released.html">Spark 1.6.2 
released</a>
+          <span class="small">(Jun 25, 2016)</span></li>
+        
+          <li><a href="/news/submit-talks-to-spark-summit-eu-2016.html">Call 
for Presentations for Spark Summit EU is Open</a>
+          <span class="small">(Jun 16, 2016)</span></li>
+        
+          <li><a href="/news/spark-2.0.0-preview.html">Preview release of 
Spark 2.0</a>
+          <span class="small">(May 26, 2016)</span></li>
+        
+          <li><a href="/news/spark-summit-june-2016-agenda-posted.html">Spark 
Summit (June 6, 2016, San Francisco) agenda posted</a>
+          <span class="small">(Apr 17, 2016)</span></li>
+        
+      </ul>
+      <p class="small" style="text-align: right;"><a 
href="/news/index.html">Archive</a></p>
+    </div>
+    <div class="hidden-xs hidden-sm">
+      <a href="/downloads.html" class="btn btn-success btn-lg btn-block" 
style="margin-bottom: 30px;">
+        Download Spark
+      </a>
+      <p style="font-size: 16px; font-weight: 500; color: #555;">
+        Built-in Libraries:
+      </p>
+      <ul class="list-none">
+        <li><a href="/sql/">SQL and DataFrames</a></li>
+        <li><a href="/streaming/">Spark Streaming</a></li>
+        <li><a href="/mllib/">MLlib (machine learning)</a></li>
+        <li><a href="/graphx/">GraphX (graph)</a></li>
+      </ul>
+      <a href="http://spark-packages.org";>Third-Party Packages</a>
+    </div>
+  </div>
+
+  <div class="col-md-9 col-md-pull-3">
+    <h2>Spark Release 2.0.0</h2>
+
+
+<p>Apache Spark 2.0.0 is the first release on the 2.x line. This release 
includes over 2500 patches from over 300 contributors. Some breaking changes 
have been made with respect to the 1.x line. To download Apache Spark 2.0.0, 
visit the <a href="http://spark.apache.org/downloads.html";>downloads</a> 
page.</p>
+
+<p>You can consult JIRA for the <a 
href="https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&amp;version=12329449";>detailed
 changes</a>. We have curated a list of high level changes here, grouped by 
major modules.</p>
+
+<ul id="markdown-toc">
+  <li><a href="#core-and-spark-sql">Core and Spark SQL</a>    <ul>
+      <li><a href="#programming-apis">Programming APIs</a></li>
+      <li><a href="#sql">SQL</a></li>
+      <li><a href="#performance">Performance</a></li>
+    </ul>
+  </li>
+  <li><a href="#mllib">MLlib</a>    <ul>
+      <li><a href="#api-changes">API changes</a></li>
+      <li><a href="#new-features">New features</a></li>
+      <li><a href="#speedscaling">Speed/scaling</a></li>
+    </ul>
+  </li>
+  <li><a href="#sparkr">SparkR</a></li>
+  <li><a href="#streaming">Streaming</a></li>
+  <li><a href="#removals-deprecations-and-breaking-changes">Removals, 
Deprecations and Breaking Changes</a>    <ul>
+      <li><a href="#removals">Removals</a></li>
+      <li><a href="#deprecations">Deprecations</a></li>
+      <li><a href="#breaking-changes">Breaking Changes</a></li>
+    </ul>
+  </li>
+  <li><a href="#known-issues">Known Issues</a></li>
+  <li><a href="#credits">Credits</a></li>
+</ul>
+
+<h3 id="core-and-spark-sql">Core and Spark SQL</h3>
+
+<h4 id="programming-apis">Programming APIs</h4>
+
+<p>One of the largest changes in Spark 2.0 is the new updated APIs:</p>
+
+<ul>
+  <li>Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset 
have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In 
Python and R, given the lack of type safety, DataFrame is the main programming 
interface.</li>
+  <li>SparkSession: new entry point that replaces the old SQLContext and 
HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept 
for backward compatibility.</li>
+  <li>A new, streamlined configuration API for SparkSession</li>
+  <li>Simpler, more performant accumulator API</li>
+</ul>
+
+<h4 id="sql">SQL</h4>
+
+<p>Spark 2.0 substantially improved SQL functionalities with SQL2003 support. 
Spark SQL can now run all 99 TPC-DS queries. More prominently, we have 
improved:</p>
+
+<ul>
+  <li>Subquery support, including</li>
+  <li>Uncorrelated Scalar Subqueries</li>
+  <li>Correlated Scalar Subqueries</li>
+  <li>NOT IN predicate Subqueries (in WHERE/HAVING clauses)</li>
+  <li>IN predicate subqueries (in WHERE/HAVING clauses)</li>
+  <li>(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)</li>
+  <li>View canonicalization support</li>
+</ul>
+
+<p>In addition, when building without Hive support, Spark SQL should have 
almost all the functionality as when building with Hive support, with the 
exception of Hive connectivity, Hive UDFs, and script transforms.</p>
+
+<h4 id="performance">Performance</h4>
+
+<ul>
+  <li>Substantial (2 - 10X) performance speedups for common operators in SQL 
and DataFrames via a new technique called whole stage code generation.</li>
+  <li>Improved Parquet scan throughput through vectorization</li>
+  <li>Improved ORC performance</li>
+  <li>Many improvements in the Catalyst query optimizer for common 
workloads</li>
+  <li>Improved window function performance via native implementations for all 
window functions</li>
+</ul>
+
+<h3 id="mllib">MLlib</h3>
+<p>The DataFrame-based API is now the primary API. The RDD-based API is 
entering maintenance mode. See the MLlib guide for details.</p>
+
+<h4 id="api-changes">API changes</h4>
+<p>The largest API change is in linear algebra.  The DataFrame-based API 
(spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather 
than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* 
on spark.mllib.*.  (SPARK-13944)
+See the MLlib migration guide for a full list of API changes.</p>
+
+<h4 id="new-features">New features</h4>
+
+<ul>
+  <li>ML persistence: The DataFrames-based API provides near-complete support 
for saving and loading ML models and Pipelines in Scala, Java, Python, and R.  
See this blog post for details.  (SPARK-6725, SPARK-11939, SPARK-14311)</li>
+  <li>MLlib in R: SparkR now offers MLlib APIs for generalized linear models, 
naive Bayes, k-means clustering, and survival regression.  See this talk to 
learn more.</li>
+  <li>Python: PySpark now offers many more MLlib algorithms, including LDA, 
Gaussian Mixture Model, Generalized Linear Regression, and more.</li>
+  <li>Algorithms added to DataFrames-based API: Bisecting K-Means clustering, 
Gaussian Mixture Model, MaxAbsScaler feature transformer.</li>
+</ul>
+
+<p>This talk lists many of these new features.</p>
+
+<h4 id="speedscaling">Speed/scaling</h4>
+<p>Vectors and Matrices stored in DataFrames now use much more efficient 
serialization, reducing overhead in calling MLlib algorithms. (SPARK-14850)</p>
+
+<h3 id="sparkr">SparkR</h3>
+
+<p>The largest improvement to SparkR in Spark 2.0 is user-defined functions. 
There are three user-defined functions: dapply, gapply, and lapply. The first 
two can be used to do partition-based UDFs using dapply and gapply, e.g. 
partitioned model learning. The latter can be used to do hyper-parameter 
tuning.</p>
+
+<p>In addition, there are a number of new features:</p>
+
+<ul>
+  <li>Improved algorithm coverage for machine learning in R, including naive 
Bayes, k-means clustering, and survival regression.</li>
+  <li>Generalized linear models support more families and link functions.</li>
+  <li>Save and load for all ML models.</li>
+  <li>More DataFrame functionality: Window functions API, reader, writer 
support for JDBC, CSV, SparkSession</li>
+</ul>
+
+<h3 id="streaming">Streaming</h3>
+
+<p>Spark 2.0 ships the initial experimental release for Structured Streaming, 
a high level streaming API built on top of Spark SQL and the Catalyst 
optimizer. Structured Streaming enables users to program against streaming 
sources and sinks using the same DataFrame/Dataset API as in static data 
sources, leveraging the Catalyst optimizer to automatically incrementalize the 
query plans.</p>
+
+<p>For the DStream API, the most prominent update is the new experimental 
support for Kafka 0.10.</p>
+
+<h3 id="removals-deprecations-and-breaking-changes">Removals, Deprecations and 
Breaking Changes</h3>
+
+<h4 id="removals">Removals</h4>
+<p>The following features have been removed in Spark 2.0:</p>
+
+<ul>
+  <li>Bagel</li>
+  <li>Support for Hadoop 2.1 and earlier</li>
+  <li>The ability to configure closure serializer</li>
+  <li>HTTPBroadcast</li>
+  <li>TTL-based metadata cleaning</li>
+  <li>Semi-private class org.apache.spark.Logging. We suggest you use slf4j 
directly.</li>
+  <li>SparkContext.metricsSystem</li>
+  <li>Block-oriented integration with Tachyon (subsumed by file system 
integration)</li>
+  <li>Methods deprecated in Spark 1.x</li>
+  <li>Methods on Python DataFrame that returned RDDs (map, flatMap, 
mapPartitions, etc). They are still available in dataframe.rdd field, e.g. 
dataframe.rdd.map.</li>
+  <li>Less frequently used streaming connectors, including Twitter, Akka, 
MQTT, ZeroMQ</li>
+  <li>Hash-based shuffle manager</li>
+  <li>History serving functionality from standalone Master</li>
+  <li>For Java and Scala, DataFrame no longer exists as a class. As a result, 
data sources would need to be updated.</li>
+</ul>
+
+<h4 id="deprecations">Deprecations</h4>
+<p>The following features have been deprecated in Spark 2.0, and might be 
removed in future versions of Spark 2.x:</p>
+
+<ul>
+  <li>Fine-grained mode in Apache Mesos</li>
+  <li>Support for Java 7</li>
+  <li>Support for Python 2.6</li>
+</ul>
+
+<h4 id="breaking-changes">Breaking Changes</h4>
+<p>The following changes might require updating existing applications that 
depend on the old behavior or API.</p>
+
+<ul>
+  <li>The default build is now using Scala 2.11 rather than Scala 2.10.</li>
+  <li>In SQL, floating literals are now parsed as decimal data type rather 
than double data type.</li>
+  <li>Kryo version is bumped to 3.0.</li>
+  <li>Java RDDâs flatMap and mapPartitions functions used to require 
functions returning Java Iterable. They have been updated to require functions 
returning Java iterator so the functions do not need to materialize all the 
data.</li>
+  <li>Java RDDâs countByKey and countAprroxDistinctByKey now returns a map 
from K to java.lang.Long, rather than to java.lang.Object.</li>
+  <li>When writing Parquet files, the summary files are not written by 
default. To re-enable it, users must set âparquet.enable.summary-metadataâ 
to true.</li>
+</ul>
+
+<p>For a more complete list, please see <a 
href="https://issues.apache.org/jira/browse/SPARK-11806";>SPARK-11806</a> for 
deprecations and removals.</p>
+
+<h3 id="known-issues">Known Issues</h3>
+
+<ul>
+  <li>Lead and Lag&#8217;s behaviors have been changed to ignoring nulls from 
respecting nulls (1.6&#8217;s behaviors). In 2.0.1, the behavioral changes will 
be fixed in 2.0.1 (SPARK-16721).</li>
+  <li>Lead and Lag functions using constant input values does not return the 
default value when the offset row does not exist (SPARK-16633).</li>
+</ul>
+
+<h3 id="credits">Credits</h3>
+<p>Last but not least, this release would not have been possible without the 
following contributors: Aaron Tokhy, Abhinav Gupta, Abou Haydar Elias, Adam 
Budde, Adam Roberts, Ahmed Kamal, Ahmed Mahran, Alex Bozarth, Alexander Ulanov, 
Allen, Anatoliy Plastinin, Andrew, Andrew Ash, Andrew Or, Andrew Ray, Anthony 
Truchet, Antonio Murgia, Arun Allamsetty, Azeem Jiva, Ben McCann, BenFradet, 
Bertrand Bossy, Bill Chambers, Bjorn Jonsson, Bo Meng, Brandon Bradley, Brian 
O&#8217;Neill, BrianLondon, Bryan Cutler, Burak KÃ¶se, Burak Yavuz, Carson 
Wang, Cazen, Charles Allen, Cheng Hao, Cheng Lian, Claes Redestad, CodingCat, 
DB Tsai, DLucky, Daniel Jalova, Daoyuan Wang, Darek Blasiak, David Tolpin, 
Davies Liu, Devaraj K, Dhruve Ashar, Dilip Biswal, Dmitry Erastov, Dominik 
JastrzÄbski, Dongjoon Hyun, Earthson Lu, Egor Pakhomov, Ehsan M.Kermani, Ergin 
Seyfe, Eric Liang, Ernest, Felix Cheung, Feynman Liang, Fokko Driesprong, 
Franklyn D&#8217;souza, FranÃ§ois Garillot, Gabriele Nizzoli, Gary King, G
 ayathriMurali, Gio Borje, Grace, Grzegorz Chilkiewicz, Guillaume Poulin, 
GÃ¡bor LiptÃ¡k, Hemant Bhanawat, Herman van Hovell, Herman van HÃ¶vell tot 
Westerflier, Hiroshi Inoue, Holden Karau, Hossein, Huaxin Gao, Imran Rashid, 
Imran Younus, Ioana Delaney, Iulian Dragos, Jacek Laskowski, Jacek Lewandowski, 
Jakob Odersky, James Lohse, James Thomas, Jason Lee, Jason Moore, Jason White, 
Jean-Baptiste OnofrÃ©, Jeff L, Jeff Zhang, Jeremy Derr, JeremyNixon, Jo 
Voordeckers, Joan, Jon Maurer, Joseph K. Bradley, Josh Howes, Josh Rosen, 
Joshi, Juarez Bochi, Julien Baley, Junyang, Junyang Qian, Jurriaan Pruis, Kai 
Jiang, KaiXinXiaoLei, Kay Ousterhout, Kazuaki Ishizaki, Kevin Yu, Koert 
Kuipers, Kousuke Saruta, Koyo Yoshida, Krishna Kalyan, Lewuathe, Liang-Chi 
Hsieh, Lianhui Wang, Lin Zhao, Lining Sun, Liu Xiang, Liwei Lin, Luc Bourlier, 
Luciano Resende, Lukasz, Maciej Brynski, Malte, Marcelo Vanzin, Marcin Tustin, 
Mark Grover, Martin Menestret, Masayoshi TSUZUKI, Matei Zaharia, Matthew Wise, 
Micha
 el Allman, Michael Armbrust, Michael Gummelt, Michel Lemay, Mike Dusenberry, 
Mortada Mehyar, Nakul Jindal, Nam Pham, Narine Kokhlikyan, NarineK, Neelesh 
Srinivas Salian, Nezih Yigitbasi, Nicholas Chammas, Nicholas Tietz, Nick 
Pentreath, Nilanjan Raychaudhuri, Nirman Narang, Nishkam Ravi, Nong, Nong Li, 
Oleg Danilov, Oliver Pierson, Oscar D. Lara Yejas, Parth Brahmbhatt, Patrick 
Wendell, Pete Robbins, Peter Ableda, Prajwal Tuladhar, Prashant Sharma, Pravin 
Gadakh, QiangCai, Qifan Pu, Raafat Akkad, Rahul Tanwani, Rajesh Balamohan, 
Rekha Joshi, Reynold Xin, Richard W. Eggert II, Robert Dodier, Robert 
Kruszewski, Robin East, Ruifeng Zheng, Ryan Blue, Sameer Agarwal, Sandeep 
Singh, Sanket, Sasaki Toru, Sean Owen, Sean Zhong, Sebastien Rainville, 
SebastiÃ¡n RamÃrez, Sela, Sergiusz Urbaniak, Shally Sangal, Sheamus K. Parkes, 
Shivaram Venkataraman, Shixiong Zhu, Shuai Lin, Shubhanshu Mishra, Sital Kedia, 
Stavros Kontopoulos, Stephan Kessler, Steve Loughran, Subhobrata Dey, Subroto 
Sanyal, 
 Sumedh Mungee, Sun Rui, Sunitha Kambhampati, Takahashi Hiroshi, Takeshi 
YAMAMURO, Takuya Kuwahara, Takuya UESHIN, Tathagata Das, Tejas Patil, Terence 
Yim, Thomas Graves, Timothy Chen, Timothy Hunter, Tom Graves, Tom Magrino, 
Tommy YU, Travis Crawford, Tristan Reid, Victor Chima, Villu Ruusmann, Wayne 
Song, WeichenXu, Weiqing Yang, Wenchen Fan, Wesley Tang, Wilson Wu, Wojciech 
Jurczyk, Xiangrui Meng, Xin Ren, Xin Wu, Xinh Huynh, Xiu Guo, Xusen Yin, Yadong 
Qi, Yanbo Liang, Yash Datta, Yin Huai, Yonathan Randolph, Yong Gang Cao, Yong 
Tang, Yu ISHIKAWA, Yucai Yu, Yuhao Yang, Yury Liavitski, Zhang, Liye, Zheng 
RuiFeng, Zheng Tan, aokolnychyi, bomeng, catapan, cody koeninger, dding3, 
depend, echo2mei, felixcheung, frreiss, fwang1, gatorsmile, guoxu1231, 
huangzhaowei, hushan, hyukjinkwon, jayadevanmurali, jeanlyn, jerryshao, 
jliwork, junhao, kaklakariada, krishnakalyan3, lfzCarlosC, lgieron, mark800, 
mathieu longtin, mcheah, meiyoula, movelikeriver, mwws, nfraison, oraviv, 
peng.zhang, pete
 rmaxlee, pierre-borckmans, poolis, prabs, proflin, pshearer, rotems, sachin 
aggarwal, sandy, scwf, seddonm1, sethah, sharkd, shijinkui, sureshthalamati, 
tedyu, thomastechs, tmnd1991, vijaykiran, wangfei, wangyang, wm...@hotmail.com, 
wujian, xin Wu, yzhou2001, zero323, zhonghaihua, zhuol, zlpmichelle, Ãrjan 
Lundberg, Yang Bo.</p>
+
+
+<p>
+<br/>
+<a href="/news/">Spark News Archive</a>
+</p>
+
+  </div>
+</div>
+
+
+
+<footer class="small">
+  <hr>
+  Apache Spark, Spark, Apache, and the Spark logo are <a 
href="https://www.apache.org/foundation/marks/";>trademarks</a> of
+  <a href="http://www.apache.org";>The Apache Software Foundation</a>.
+</footer>
+
+</div>
+
+</body>
+</html>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark-website git commit: Add 2.0.0 release notes

Reply via email to