[2/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide

jkbradley Fri, 15 Jul 2016 13:39:07 -0700

[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide


## What changes were proposed in this pull request?

Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not 
mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top 
redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based 
API
  * **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for 
RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
  * Also moved past guides from mllib-migration-guides to ml-migration-guides, 
with a redirect link on mllib-migration-guides
  * **Reviewers**: I did not change any of the content of the migration guides.

Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: 
overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into 
ml-tuning.html
  * **Reviewers**: I did not change the content of these guides, except some 
intro text.
* Sidebar remains the same, but with pipeline and tuning sections added

Other:
* ml-classification-regression.html: Moved text about linear methods to new 
section in page

## How was this patch tested?

Generated docs locally

Author: Joseph K. Bradley <jos...@databricks.com>

Closes #14213 from jkbradley/ml-guide-2.0.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5ffd5d38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5ffd5d38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5ffd5d38

Branch: refs/heads/master
Commit: 5ffd5d3838da40ad408a6f40071fe6f4dcacf2a1
Parents: 71ad945
Author: Joseph K. Bradley <jos...@databricks.com>
Authored: Fri Jul 15 13:38:23 2016 -0700
Committer: Joseph K. Bradley <jos...@databricks.com>
Committed: Fri Jul 15 13:38:23 2016 -0700

----------------------------------------------------------------------
 docs/_data/menu-ml.yaml                 |   6 +-
 docs/_includes/nav-left-wrapper-ml.html |   4 +-
 docs/_layouts/global.html               |   2 +-
 docs/index.md                           |   4 +-
 docs/ml-advanced.md                     |   4 +-
 docs/ml-ann.md                          |   4 +-
 docs/ml-classification-regression.md    |  60 ++--
 docs/ml-clustering.md                   |   8 +-
 docs/ml-collaborative-filtering.md      |   4 +-
 docs/ml-decision-tree.md                |   4 +-
 docs/ml-ensembles.md                    |   4 +-
 docs/ml-features.md                     |   4 +-
 docs/ml-guide.md                        | 461 ++++++++++-----------------
 docs/ml-linear-methods.md               |   4 +-
 docs/ml-migration-guides.md             | 159 +++++++++
 docs/ml-pipeline.md                     | 245 ++++++++++++++
 docs/ml-survival-regression.md          |   4 +-
 docs/ml-tuning.md                       | 121 +++++++
 docs/mllib-classification-regression.md |   4 +-
 docs/mllib-clustering.md                |   4 +-
 docs/mllib-collaborative-filtering.md   |   4 +-
 docs/mllib-data-types.md                |   4 +-
 docs/mllib-decision-tree.md             |   4 +-
 docs/mllib-dimensionality-reduction.md  |   4 +-
 docs/mllib-ensembles.md                 |   4 +-
 docs/mllib-evaluation-metrics.md        |   4 +-
 docs/mllib-feature-extraction.md        |   4 +-
 docs/mllib-frequent-pattern-mining.md   |   4 +-
 docs/mllib-guide.md                     | 219 +------------
 docs/mllib-isotonic-regression.md       |   4 +-
 docs/mllib-linear-methods.md            |   4 +-
 docs/mllib-migration-guides.md          | 158 +--------
 docs/mllib-naive-bayes.md               |   4 +-
 docs/mllib-optimization.md              |   4 +-
 docs/mllib-pmml-model-export.md         |   4 +-
 docs/mllib-statistics.md                |   4 +-
 docs/programming-guide.md               |   2 +-
 docs/streaming-programming-guide.md     |   4 +-
 python/pyspark/ml/__init__.py           |   4 +-
 python/pyspark/ml/tests.py              |   2 +-
 python/pyspark/mllib/__init__.py        |   5 +-
 41 files changed, 814 insertions(+), 746 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/_data/menu-ml.yaml
----------------------------------------------------------------------
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 3fd3ee2..0c6b9b2 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -1,5 +1,5 @@
-- text: "Overview: estimators, transformers and pipelines"
-  url: ml-guide.html
+- text: Pipelines
+  url: ml-pipeline.html
 - text: Extracting, transforming and selecting features
   url: ml-features.html
 - text: Classification and Regression
@@ -8,5 +8,7 @@
   url: ml-clustering.html
 - text: Collaborative filtering
   url: ml-collaborative-filtering.html
+- text: Model selection and tuning
+  url: ml-tuning.html
 - text: Advanced topics
   url: ml-advanced.html

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/_includes/nav-left-wrapper-ml.html
----------------------------------------------------------------------
diff --git a/docs/_includes/nav-left-wrapper-ml.html 
b/docs/_includes/nav-left-wrapper-ml.html
index e2d7eda..00ac6cc 100644
--- a/docs/_includes/nav-left-wrapper-ml.html
+++ b/docs/_includes/nav-left-wrapper-ml.html
@@ -1,8 +1,8 @@
 <div class="left-menu-wrapper">
     <div class="left-menu">
-        <h3><a href="ml-guide.html">spark.ml package</a></h3>
+        <h3><a href="ml-guide.html">MLlib: Main Guide</a></h3>
         {% include nav-left.html nav=include.nav-ml %}
-        <h3><a href="mllib-guide.html">spark.mllib package</a></h3>
+        <h3><a href="mllib-guide.html">MLlib: RDD-based API Guide</a></h3>
         {% include nav-left.html nav=include.nav-mllib %}
     </div>
 </div>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/_layouts/global.html
----------------------------------------------------------------------
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 2d0c3fd..d3bf082 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -74,7 +74,7 @@
                                 <li><a 
href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a 
href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
                                 <li><a 
href="structured-streaming-programming-guide.html">Structured Streaming</a></li>
-                                <li><a href="mllib-guide.html">MLlib (Machine 
Learning)</a></li>
+                                <li><a href="ml-guide.html">MLlib (Machine 
Learning)</a></li>
                                 <li><a 
href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
                                 <li><a href="sparkr.html">SparkR (R on 
Spark)</a></li>
                             </ul>

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/index.md
----------------------------------------------------------------------
diff --git a/docs/index.md b/docs/index.md
index 7157afc..0cb8803 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -8,7 +8,7 @@ description: Apache Spark SPARK_VERSION_SHORT documentation 
homepage
 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in Java, Scala, Python and R,
 and an optimized engine that supports general execution graphs.
-It also supports a rich set of higher-level tools including [Spark 
SQL](sql-programming-guide.html) for SQL and structured data processing, 
[MLlib](mllib-guide.html) for machine learning, 
[GraphX](graphx-programming-guide.html) for graph processing, and [Spark 
Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Spark 
SQL](sql-programming-guide.html) for SQL and structured data processing, 
[MLlib](ml-guide.html) for machine learning, 
[GraphX](graphx-programming-guide.html) for graph processing, and [Spark 
Streaming](streaming-programming-guide.html).
 
 # Downloading
 
@@ -87,7 +87,7 @@ options for deployment:
 * Modules built on Spark:
   * [Spark Streaming](streaming-programming-guide.html): processing real-time 
data streams
   * [Spark SQL, Datasets, and DataFrames](sql-programming-guide.html): support 
for structured data and relational queries
-  * [MLlib](mllib-guide.html): built-in machine learning library
+  * [MLlib](ml-guide.html): built-in machine learning library
   * [GraphX](graphx-programming-guide.html): Spark's new API for graph 
processing
 
 **API Docs:**

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-advanced.md
----------------------------------------------------------------------
diff --git a/docs/ml-advanced.md b/docs/ml-advanced.md
index 1c5f844..f5804fd 100644
--- a/docs/ml-advanced.md
+++ b/docs/ml-advanced.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Advanced topics - spark.ml
-displayTitle: Advanced topics - spark.ml
+title: Advanced topics
+displayTitle: Advanced topics
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-ann.md
----------------------------------------------------------------------
diff --git a/docs/ml-ann.md b/docs/ml-ann.md
index c2d9bd2..7c460c4 100644
--- a/docs/ml-ann.md
+++ b/docs/ml-ann.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Multilayer perceptron classifier - spark.ml
-displayTitle: Multilayer perceptron classifier - spark.ml
+title: Multilayer perceptron classifier
+displayTitle: Multilayer perceptron classifier
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index 3d6106b..7c2437e 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Classification and regression - spark.ml
-displayTitle: Classification and regression - spark.ml
+title: Classification and regression
+displayTitle: Classification and regression
 ---
 
 
@@ -22,37 +22,14 @@ displayTitle: Classification and regression - spark.ml
 \newcommand{\zero}{\mathbf{0}}
 \]`
 
+This page covers algorithms for Classification and Regression.  It also 
includes sections
+discussing specific classes of algorithms, such as linear methods, trees, and 
ensembles.
+
 **Table of Contents**
 
 * This will become a table of contents (this text will be scraped).
 {:toc}
 
-In `spark.ml`, we implement popular linear methods such as logistic
-regression and linear least squares with $L_1$ or $L_2$ regularization.
-Refer to [the linear methods in mllib](mllib-linear-methods.html) for
-details about implementation and tuning.  We also include a DataFrame API for 
[Elastic
-net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
-of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
-and variable selection via the elastic
-net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
-Mathematically, it is defined as a convex combination of the $L_1$ and
-the $L_2$ regularization terms:
-`\[
-\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( 
\frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
-\]`
-By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
-regularization as special cases. For example, if a [linear
-regression](https://en.wikipedia.org/wiki/Linear_regression) model is
-trained with the elastic net parameter $\alpha$ set to $1$, it is
-equivalent to a
-[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
-On the other hand, if $\alpha$ is set to $0$, the trained model reduces
-to a [ridge
-regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
-We implement Pipelines API for both linear regression and logistic
-regression with elastic net regularization.
-
-
 # Classification
 
 ## Logistic regression
@@ -760,7 +737,34 @@ Refer to the [`IsotonicRegression` Python 
docs](api/python/pyspark.ml.html#pyspa
 </div>
 </div>
 
+# Linear methods
+
+We implement popular linear methods such as logistic
+regression and linear least squares with $L_1$ or $L_2$ regularization.
+Refer to [the linear methods guide for the RDD-based 
API](mllib-linear-methods.html) for
+details about implementation and tuning; this information is still relevant.
 
+We also include a DataFrame API for [Elastic
+net](http://en.wikipedia.org/wiki/Elastic_net_regularization), a hybrid
+of $L_1$ and $L_2$ regularization proposed in [Zou et al, Regularization
+and variable selection via the elastic
+net](http://users.stat.umn.edu/~zouxx019/Papers/elasticnet.pdf).
+Mathematically, it is defined as a convex combination of the $L_1$ and
+the $L_2$ regularization terms:
+`\[
+\alpha \left( \lambda \|\wv\|_1 \right) + (1-\alpha) \left( 
\frac{\lambda}{2}\|\wv\|_2^2 \right) , \alpha \in [0, 1], \lambda \geq 0
+\]`
+By setting $\alpha$ properly, elastic net contains both $L_1$ and $L_2$
+regularization as special cases. For example, if a [linear
+regression](https://en.wikipedia.org/wiki/Linear_regression) model is
+trained with the elastic net parameter $\alpha$ set to $1$, it is
+equivalent to a
+[Lasso](http://en.wikipedia.org/wiki/Least_squares#Lasso_method) model.
+On the other hand, if $\alpha$ is set to $0$, the trained model reduces
+to a [ridge
+regression](http://en.wikipedia.org/wiki/Tikhonov_regularization) model.
+We implement Pipelines API for both linear regression and logistic
+regression with elastic net regularization.
 
 # Decision trees
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-clustering.md
----------------------------------------------------------------------
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index 8656eb4..8a0a61c 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -1,10 +1,12 @@
 ---
 layout: global
-title: Clustering - spark.ml
-displayTitle: Clustering - spark.ml
+title: Clustering
+displayTitle: Clustering
 ---
 
-In this section, we introduce the pipeline API for [clustering in 
mllib](mllib-clustering.html).
+This page describes clustering algorithms in MLlib.
+The [guide for clustering in the RDD-based API](mllib-clustering.html) also 
has relevant information
+about these algorithms.
 
 **Table of Contents**
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/ml-collaborative-filtering.md 
b/docs/ml-collaborative-filtering.md
index 8bd75f3..1d02d69 100644
--- a/docs/ml-collaborative-filtering.md
+++ b/docs/ml-collaborative-filtering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Collaborative Filtering - spark.ml
-displayTitle: Collaborative Filtering - spark.ml
+title: Collaborative Filtering
+displayTitle: Collaborative Filtering
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/ml-decision-tree.md b/docs/ml-decision-tree.md
index a721d55..5e1eeb9 100644
--- a/docs/ml-decision-tree.md
+++ b/docs/ml-decision-tree.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Decision trees - spark.ml
-displayTitle: Decision trees - spark.ml
+title: Decision trees
+displayTitle: Decision trees
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/ml-ensembles.md b/docs/ml-ensembles.md
index 303773e..97f1bdc 100644
--- a/docs/ml-ensembles.md
+++ b/docs/ml-ensembles.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Tree ensemble methods - spark.ml
-displayTitle: Tree ensemble methods - spark.ml
+title: Tree ensemble methods
+displayTitle: Tree ensemble methods
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 88fd291..e7d7ddf 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Extracting, transforming and selecting features - spark.ml
-displayTitle: Extracting, transforming and selecting features - spark.ml
+title: Extracting, transforming and selecting features
+displayTitle: Extracting, transforming and selecting features
 ---
 
 This section covers algorithms for working with features, roughly divided into 
these groups:

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-guide.md
----------------------------------------------------------------------
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index dae86d8..5abec63 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -1,323 +1,214 @@
 ---
 layout: global
-title: "Overview: estimators, transformers and pipelines - spark.ml"
-displayTitle: "Overview: estimators, transformers and pipelines - spark.ml"
+title: "MLlib: Main Guide"
+displayTitle: "Machine Learning Library (MLlib) Guide"
 ---
 
+MLlib is Spark's machine learning (ML) library.
+Its goal is to make practical machine learning scalable and easy.
+At a high level, it provides tools such as:
 
-`\[
-\newcommand{\R}{\mathbb{R}}
-\newcommand{\E}{\mathbb{E}}
-\newcommand{\x}{\mathbf{x}}
-\newcommand{\y}{\mathbf{y}}
-\newcommand{\wv}{\mathbf{w}}
-\newcommand{\av}{\mathbf{\alpha}}
-\newcommand{\bv}{\mathbf{b}}
-\newcommand{\N}{\mathbb{N}}
-\newcommand{\id}{\mathbf{I}}
-\newcommand{\ind}{\mathbf{1}}
-\newcommand{\0}{\mathbf{0}}
-\newcommand{\unit}{\mathbf{e}}
-\newcommand{\one}{\mathbf{1}}
-\newcommand{\zero}{\mathbf{0}}
-\]`
+* ML Algorithms: common learning algorithms such as classification, 
regression, clustering, and collaborative filtering
+* Featurization: feature extraction, transformation, dimensionality reduction, 
and selection
+* Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
+* Persistence: saving and load algorithms, models, and Pipelines
+* Utilities: linear algebra, statistics, data handling, etc.
 
+# Announcement: DataFrame-based API is primary API
 
-The `spark.ml` package aims to provide a uniform set of high-level APIs built 
on top of
-[DataFrames](sql-programming-guide.html#dataframes) that help users create and 
tune practical
-machine learning pipelines.
-See the [algorithm guides](#algorithm-guides) section below for guides on 
sub-packages of
-`spark.ml`, including feature transformers unique to the Pipelines API, 
ensembles, and more.
+**The MLlib RDD-based API is now in maintenance mode.**
 
-**Table of contents**
+As of Spark 2.0, the 
[RDD](programming-guide.html#resilient-distributed-datasets-rdds)-based APIs in 
the `spark.mllib` package have entered maintenance mode.
+The primary Machine Learning API for Spark is now the 
[DataFrame](sql-programming-guide.html)-based API in the `spark.ml` package.
 
-* This will become a table of contents (this text will be scraped).
-{:toc}
+*What are the implications?*
 
+* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
+* MLlib will not add new features to the RDD-based API.
+* In the Spark 2.x releases, MLlib will add features to the DataFrames-based 
API to reach feature parity with the RDD-based API.
+* After reaching feature parity (roughly estimated for Spark 2.2), the 
RDD-based API will be deprecated.
+* The RDD-based API is expected to be removed in Spark 3.0.
 
-# Main concepts in Pipelines
+*Why is MLlib switching to the DataFrame-based API?*
 
-Spark ML standardizes APIs for machine learning algorithms to make it easier 
to combine multiple
-algorithms into a single pipeline, or workflow.
-This section covers the key concepts introduced by the Spark ML API, where the 
pipeline concept is
-mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+* DataFrames provide a more user-friendly API than RDDs.  The many benefits of 
DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and 
Catalyst optimizations, and uniform APIs across languages.
+* The DataFrame-based API for MLlib provides a uniform API across ML 
algorithms and across multiple languages.
+* DataFrames facilitate practical ML Pipelines, particularly feature 
transformations.  See the [Pipelines guide](ml-pipeline.md) for details.
 
-* **[`DataFrame`](ml-guide.html#dataframe)**: Spark ML uses `DataFrame` from 
Spark SQL as an ML
-  dataset, which can hold a variety of data types.
-  E.g., a `DataFrame` could have different columns storing text, feature 
vectors, true labels, and predictions.
+# Dependencies
 
-* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an 
algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms a `DataFrame` with 
features into a `DataFrame` with predictions.
+MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), 
which depends on
+[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical 
processing.
+If native libraries[^1] are not available at runtime, you will see a warning 
message and a pure JVM
+implementation will be used instead.
 
-* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm 
which can be fit on a `DataFrame` to produce a `Transformer`.
-E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and 
produces a model.
+Due to licensing issues with runtime proprietary binaries, we do not include 
`netlib-java`'s native
+proxies by default.
+To configure `netlib-java` / Breeze to use system optimised binaries, include
+`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as 
a dependency of your
+project and read the [netlib-java](https://github.com/fommil/netlib-java) 
documentation for your
+platform's additional installation instructions.
 
-* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple 
`Transformer`s and `Estimator`s together to specify an ML workflow.
+To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 
1.4 or newer.
 
-* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and 
`Estimator`s now share a common API for specifying parameters.
+[^1]: To learn more about the benefits and background of system optimised 
natives, you may wish to
+    watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/#/).
 
-## DataFrame
+# Migration guide
 
-Machine learning can be applied to a wide variety of data types, such as 
vectors, text, images, and structured data.
-Spark ML adopts the `DataFrame` from Spark SQL in order to support a variety 
of data types.
+MLlib is under active development.
+The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
+and the migration guide below will explain all changes between releases.
 
-`DataFrame` supports many basic and structured types; see the [Spark SQL 
datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) 
for a list of supported types.
-In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML 
[`Vector`](mllib-data-types.html#local-vector) types.
+## From 1.6 to 2.0
 
-A `DataFrame` can be created either implicitly or explicitly from a regular 
`RDD`.  See the code examples below and the [Spark SQL programming 
guide](sql-programming-guide.html) for examples.
+### Breaking changes
 
-Columns in a `DataFrame` are named.  The code examples below use names such as 
"text," "features," and "label."
+There were several breaking changes in Spark 2.0, which are outlined below.
 
-## Pipeline components
+**Linear algebra classes for DataFrame-based APIs**
 
-### Transformers
+Spark's linear algebra dependencies were moved to a new project, `mllib-local` 
+(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
+As part of this change, the linear algebra classes were copied to a new 
package, `spark.ml.linalg`. 
+The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` 
classes, 
+leading to a few breaking changes, predominantly in various model classes 
+(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a 
full list).
 
-A `Transformer` is an abstraction that includes feature transformers and 
learned models.
-Technically, a `Transformer` implements a method `transform()`, which converts 
one `DataFrame` into
-another, generally by appending one or more columns.
-For example:
+**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the 
previous package `spark.mllib.linalg`.
 
-* A feature transformer might take a `DataFrame`, read a column (e.g., text), 
map it into a new
-  column (e.g., feature vectors), and output a new `DataFrame` with the mapped 
column appended.
-* A learning model might take a `DataFrame`, read the column containing 
feature vectors, predict the
-  label for each feature vector, and output a new `DataFrame` with predicted 
labels appended as a
-  column.
+_Converting vectors and matrices_
 
-### Estimators
+While most pipeline components support backward compatibility for loading, 
+some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that 
contain vector or matrix 
+columns, may need to be migrated to the new `spark.ml` vector and matrix 
types. 
+Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to 
`spark.ml.linalg` types
+(and vice versa) can be found in `spark.mllib.util.MLUtils`.
 
-An `Estimator` abstracts the concept of a learning algorithm or any algorithm 
that fits or trains on
-data.
-Technically, an `Estimator` implements a method `fit()`, which accepts a 
`DataFrame` and produces a
-`Model`, which is a `Transformer`.
-For example, a learning algorithm such as `LogisticRegression` is an 
`Estimator`, and calling
-`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a 
`Transformer`.
-
-### Properties of pipeline components
-
-`Transformer.transform()`s and `Estimator.fit()`s are both stateless.  In the 
future, stateful algorithms may be supported via alternative concepts.
-
-Each instance of a `Transformer` or `Estimator` has a unique ID, which is 
useful in specifying parameters (discussed below).
-
-## Pipeline
-
-In machine learning, it is common to run a sequence of algorithms to process 
and learn from data.
-E.g., a simple text document processing workflow might include several stages:
-
-* Split each document's text into words.
-* Convert each document's words into a numerical feature vector.
-* Learn a prediction model using the feature vectors and labels.
-
-Spark ML represents such a workflow as a `Pipeline`, which consists of a 
sequence of
-`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific 
order.
-We will use this simple workflow as a running example in this section.
-
-### How it works
-
-A `Pipeline` is specified as a sequence of stages, and each stage is either a 
`Transformer` or an `Estimator`.
-These stages are run in order, and the input `DataFrame` is transformed as it 
passes through each stage.
-For `Transformer` stages, the `transform()` method is called on the 
`DataFrame`.
-For `Estimator` stages, the `fit()` method is called to produce a 
`Transformer` (which becomes part of the `PipelineModel`, or fitted 
`Pipeline`), and that `Transformer`'s `transform()` method is called on the 
`DataFrame`.
-
-We illustrate this for the simple text document workflow.  The figure below is 
for the *training time* usage of a `Pipeline`.
-
-<p style="text-align: center;">
-  <img
-    src="img/ml-Pipeline.png"
-    title="Spark ML Pipeline Example"
-    alt="Spark ML Pipeline Example"
-    width="80%"
-  />
-</p>
-
-Above, the top row represents a `Pipeline` with three stages.
-The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the 
third (`LogisticRegression`) is an `Estimator` (red).
-The bottom row represents data flowing through the pipeline, where cylinders 
indicate `DataFrame`s.
-The `Pipeline.fit()` method is called on the original `DataFrame`, which has 
raw text documents and labels.
-The `Tokenizer.transform()` method splits the raw text documents into words, 
adding a new column with words to the `DataFrame`.
-The `HashingTF.transform()` method converts the words column into feature 
vectors, adding a new column with those vectors to the `DataFrame`.
-Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls 
`LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
-If the `Pipeline` had more stages, it would call the 
`LogisticRegressionModel`'s `transform()`
-method on the `DataFrame` before passing the `DataFrame` to the next stage.
-
-A `Pipeline` is an `Estimator`.
-Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, 
which is a
-`Transformer`.
-This `PipelineModel` is used at *test time*; the figure below illustrates this 
usage.
-
-<p style="text-align: center;">
-  <img
-    src="img/ml-PipelineModel.png"
-    title="Spark ML PipelineModel Example"
-    alt="Spark ML PipelineModel Example"
-    width="80%"
-  />
-</p>
-
-In the figure above, the `PipelineModel` has the same number of stages as the 
original `Pipeline`, but all `Estimator`s in the original `Pipeline` have 
become `Transformer`s.
-When the `PipelineModel`'s `transform()` method is called on a test dataset, 
the data are passed
-through the fitted pipeline in order.
-Each stage's `transform()` method updates the dataset and passes it to the 
next stage.
-
-`Pipeline`s and `PipelineModel`s help to ensure that training and test data go 
through identical feature processing steps.
-
-### Details
-
-*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array.  
The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in 
which each stage uses data produced by the previous stage.  It is possible to 
create non-linear `Pipeline`s as long as the data flow graph forms a Directed 
Acyclic Graph (DAG).  This graph is currently specified implicitly based on the 
input and output column names of each stage (generally specified as 
parameters).  If the `Pipeline` forms a DAG, then the stages must be specified 
in topological order.
-
-*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied 
types, they cannot use
-compile-time type checking.
-`Pipeline`s and `PipelineModel`s instead do runtime checking before actually 
running the `Pipeline`.
-This type checking is done using the `DataFrame` *schema*, a description of 
the data types of columns in the `DataFrame`.
-
-*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances.  
E.g., the same instance
-`myHashingTF` should not be inserted into the `Pipeline` twice since 
`Pipeline` stages must have
-unique IDs.  However, different instances `myHashingTF1` and `myHashingTF2` 
(both of type `HashingTF`)
-can be put into the same `Pipeline` since different instances will be created 
with different IDs.
-
-## Parameters
-
-Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying 
parameters.
-
-A `Param` is a named parameter with self-contained documentation.
-A `ParamMap` is a set of (parameter, value) pairs.
-
-There are two main ways to pass parameters to an algorithm:
-
-1. Set parameters for an instance.  E.g., if `lr` is an instance of 
`LogisticRegression`, one could
-   call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
-   This API resembles the API used in `spark.mllib` package.
-2. Pass a `ParamMap` to `fit()` or `transform()`.  Any parameters in the 
`ParamMap` will override parameters previously specified via setter methods.
-
-Parameters belong to specific instances of `Estimator`s and `Transformer`s.
-For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, 
then we can build a `ParamMap` with both `maxIter` parameters specified: 
`ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
-This is useful if there are two algorithms with the `maxIter` parameter in a 
`Pipeline`.
-
-## Saving and Loading Pipelines
-
-Often times it is worth it to save a model or a pipeline to disk for later 
use. In Spark 1.6, a model import/export functionality was added to the 
Pipeline API. Most basic transformers are supported as well as some of the more 
basic ML models. Please refer to the algorithm's API documentation to see if 
saving and loading is supported.
-
-# Code examples
-
-This section gives code examples illustrating the functionality discussed 
above.
-For more info, please refer to the API documentation
-([Scala](api/scala/index.html#org.apache.spark.ml.package),
-[Java](api/java/org/apache/spark/ml/package-summary.html),
-and [Python](api/python/pyspark.ml.html)).
-Some Spark ML algorithms are wrappers for `spark.mllib` algorithms, and the
-[MLlib programming guide](mllib-guide.html) has details on specific algorithms.
-
-## Example: Estimator, Transformer, and Param
-
-This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+There are also utility methods available for converting single instances of 
+vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / 
`mllib.linalg.Matrix`
+for converting to `ml.linalg` types, and 
+`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
+for converting to `mllib.linalg` types.
 
 <div class="codetabs">
+<div data-lang="scala"  markdown="1">
 
-<div data-lang="scala">
-{% include_example 
scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
-</div>
+{% highlight scala %}
+import org.apache.spark.mllib.util.MLUtils
 
-<div data-lang="java">
-{% include_example 
java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
-</div>
+// convert DataFrame columns
+val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+// convert a single vector or matrix
+val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
+val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
+{% endhighlight %}
 
-<div data-lang="python">
-{% include_example python/ml/estimator_transformer_param_example.py %}
-</div>
-
-</div>
-
-## Example: Pipeline
-
-This example follows the simple text document `Pipeline` illustrated in the 
figures above.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
-</div>
-
-<div data-lang="java">
-{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java 
%}
-</div>
-
-<div data-lang="python">
-{% include_example python/ml/pipeline_example.py %}
-</div>
-
-</div>
-
-## Example: model selection via cross-validation
-
-An important task in ML is *model selection*, or using data to find the best 
model or parameters for a given task.  This is also called *tuning*.
-`Pipeline`s facilitate model selection by making it easy to tune an entire 
`Pipeline` at once, rather than tuning each element in the `Pipeline` 
separately.
-
-Currently, `spark.ml` supports model selection using the 
[`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator)
 class, which takes an `Estimator`, a set of `ParamMap`s, and an 
[`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator).
-`CrossValidator` begins by splitting the dataset into a set of *folds* which 
are used as separate training and test datasets; e.g., with `$k=3$` folds, 
`CrossValidator` will generate 3 (training, test) dataset pairs, each of which 
uses 2/3 of the data for training and 1/3 for testing.
-`CrossValidator` iterates through the set of `ParamMap`s. For each `ParamMap`, 
it trains the given `Estimator` and evaluates it using the given `Evaluator`.
-
-The `Evaluator` can be a 
[`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
-for regression problems, a 
[`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
-for binary data, or a 
[`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
-for multiclass problems. The default metric used to choose the best `ParamMap` 
can be overridden by the `setMetricName`
-method in each of these evaluators.
-
-The `ParamMap` which produces the best evaluation metric (averaged over the 
`$k$` folds) is selected as the best model.
-`CrossValidator` finally fits the `Estimator` using the best `ParamMap` and 
the entire dataset.
-
-The following example demonstrates using `CrossValidator` to select from a 
grid of parameters.
-To help construct the parameter grid, we use the 
[`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder)
 utility.
-
-Note that cross-validation over a grid of parameters is expensive.
-E.g., in the example below, the parameter grid has 3 values for 
`hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` 
uses 2 folds.  This multiplies out to `$(3 \times 2) \times 2 = 12$` different 
models being trained.
-In realistic settings, it can be common to try many more parameters and use 
more folds (`$k=3$` and `$k=10$` are common).
-In other words, using `CrossValidator` can be very expensive.
-However, it is also a well-established method for choosing parameters which is 
more statistically sound than heuristic hand-tuning.
-
-<div class="codetabs">
-
-<div data-lang="scala">
-{% include_example 
scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala
 %}
-</div>
-
-<div data-lang="java">
-{% include_example 
java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java
 %}
-</div>
-
-<div data-lang="python">
-
-{% include_example python/ml/cross_validator.py %}
-</div>
-
-</div>
-
-## Example: model selection via train validation split
-In addition to  `CrossValidator` Spark also offers `TrainValidationSplit` for 
hyper-parameter tuning.
-`TrainValidationSplit` only evaluates each combination of parameters once, as 
opposed to k times in
- the case of `CrossValidator`. It is therefore less expensive,
- but will not produce as reliable results when the training dataset is not 
sufficiently large.
-
-`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in 
the `estimatorParamMaps` parameter,
-and an `Evaluator`.
-It begins by splitting the dataset into two parts using the `trainRatio` 
parameter
-which are used as separate training and test datasets. For example with 
`$trainRatio=0.75$` (default),
-`TrainValidationSplit` will generate a training and test dataset pair where 
75% of the data is used for training and 25% for validation.
-Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the 
set of `ParamMap`s.
-For each combination of parameters, it trains the given `Estimator` and 
evaluates it using the given `Evaluator`.
-The `ParamMap` which produces the best evaluation metric is selected as the 
best option.
-`TrainValidationSplit` finally fits the `Estimator` using the best `ParamMap` 
and the entire dataset.
-
-<div class="codetabs">
-
-<div data-lang="scala" markdown="1">
-{% include_example 
scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala
 %}
+Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further 
detail.
 </div>
 
 <div data-lang="java" markdown="1">
-{% include_example 
java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java
 %}
-</div>
 
-<div data-lang="python">
-{% include_example python/ml/train_validation_split.py %}
-</div>
+{% highlight java %}
+import org.apache.spark.mllib.util.MLUtils;
+import org.apache.spark.sql.Dataset;
+
+// convert DataFrame columns
+Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
+Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
+// convert a single vector or matrix
+org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
+org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
+{% endhighlight %}
+
+Refer to the [`MLUtils` Java 
docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
+</div>
+
+<div data-lang="python"  markdown="1">
+
+{% highlight python %}
+from pyspark.mllib.util import MLUtils
+
+# convert DataFrame columns
+convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
+convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
+# convert a single vector or matrix
+mlVec = mllibVec.asML()
+mlMat = mllibMat.asML()
+{% endhighlight %}
+
+Refer to the [`MLUtils` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further 
detail.
+</div>
+</div>
+
+**Deprecated methods removed**
+
+Several deprecated methods were removed in the `spark.mllib` and `spark.ml` 
packages:
+
+* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
+* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
+* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as 
`DeveloperApi`)
+* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these 
functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
+* `defaultStategy` in `mllib.tree.configuration.Strategy`
+* `build` in `mllib.tree.Node`
+* libsvm loaders for multiclass and load/save labeledData methods in 
`mllib.util.MLUtils`
+
+A full list of breaking changes can be found at 
[SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+Deprecations in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
+ In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been 
deprecated.
+* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
+ In `spark.ml.regression.RandomForestRegressionModel` and 
`spark.ml.classification.RandomForestClassificationModel`,
+ the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
+* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
+ In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
+ We move all functionality in overridden methods to the corresponding 
`transformSchema`.
+* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
+ In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, 
`RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
+ We encourage users to use `spark.ml.regression.LinearRegresson` and 
`spark.ml.classification.LogisticRegresson`.
+* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
+ In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, 
`recall` and `fMeasure` have been deprecated in favor of `accuracy`.
+* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
+ In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` 
method has been deprecated in favor of `session`.
+* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been 
deprecated since it was not used by `ChiSqSelectorModel`.
+
+**Changes of behavior**
+
+Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
+
+* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
+ `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls 
`spark.ml.classification.LogisticRegresson` for binary classification now.
+ This will introduce the following behavior changes for 
`spark.mllib.classification.LogisticRegressionWithLBFGS`:
+    * The intercept will not be regularized when training binary 
classification model with L1/L2 Updater.
+    * If users set without regularization, training with or without feature 
scaling will return the same solution by the same convergence rate.
+* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
+ In order to provide better and consistent result with 
`spark.ml.classification.LogisticRegresson`,
+ the default value of 
`spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has 
been changed from 1E-4 to 1E-6.
+* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
+ Fix a bug of `PowerIterationClustering` which will likely change its result.
+* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
+ `LDA` using the `EM` optimizer will keep the last checkpoint by default, if 
checkpointing is being used.
+* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
+ `Word2Vec` now respects sentence boundaries. Previously, it did not handle 
them correctly.
+* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
+ `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` 
and `spark.mllib`.
+* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
+ The `expectedType` argument for PySpark `Param` was removed.
+* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
+ Some default `Param` values, which were mismatched between pipelines in Scala 
and Python, have been changed.
+* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
+ `QuantileDiscretizer` now uses 
`spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously 
used custom sampling logic).
+ The output buckets will differ for same input data and params.
+
+## Previous Spark versions
+
+Earlier migration guides are archived [on this page](ml-migration-guides.html).
 
-</div>
+---

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/ml-linear-methods.md b/docs/ml-linear-methods.md
index a875483..eb39173 100644
--- a/docs/ml-linear-methods.md
+++ b/docs/ml-linear-methods.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Linear methods - spark.ml
-displayTitle: Linear methods - spark.ml
+title: Linear methods
+displayTitle: Linear methods
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md
new file mode 100644
index 0000000..82bf9d7
--- /dev/null
+++ b/docs/ml-migration-guides.md
@@ -0,0 +1,159 @@
+---
+layout: global
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
+description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+---
+
+The migration guide for the current Spark version is kept on the [MLlib Guide 
main page](ml-guide.html#migration-guide).
+
+## From 1.5 to 1.6
+
+There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, 
but there are
+deprecations and changes of behavior.
+
+Deprecations:
+
+* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
+ In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
+* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
+ In `spark.ml.classification.LogisticRegressionModel` and
+ `spark.ml.regression.LinearRegressionModel`, the `weights` field has been 
deprecated in favor of
+ the new name `coefficients`.  This helps disambiguate from instance (row) 
"weights" given to
+ algorithms.
+
+Changes of behavior:
+
+* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
+ `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed 
semantics in 1.6.
+ Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of
+ `GradientDescent`'s `convergenceTol`: For large errors, it uses relative 
error (relative to the
+ previous error); for small errors (`< 0.01`), it uses absolute error.
+* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
+ `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to 
lowercase before
+ tokenizing. Now, it converts to lowercase by default, with an option not to. 
This matches the
+ behavior of the simpler `Tokenizer` transformer.
+
+## From 1.4 to 1.5
+
+In the `spark.mllib` package, there are no breaking API changes but several 
behavior changes:
+
+* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
+  `RegressionMetrics.explainedVariance` returns the average regression sum of 
squares.
+* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): 
`NaiveBayesModel.labels` become
+  sorted.
+* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): 
`GradientDescent` has a default
+  convergence tolerance `1e-3`, and hence iterations might end earlier than 
1.4.
+
+In the `spark.ml` package, there exists one breaking API change and one 
behavior change:
+
+* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's 
varargs support is removed
+  from `Params.setDefault` due to a
+  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
+* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): 
`Evaluator.isLargerBetter` is
+  added to indicate metric ordering. Metrics like RMSE no longer flip signs as 
in 1.4.
+
+## From 1.3 to 1.4
+
+In the `spark.mllib` package, there were several breaking changes, but all in 
`DeveloperApi` or `Experimental` APIs:
+
+* Gradient-Boosted Trees
+    * *(Breaking change)* The signature of the 
[`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) 
method was changed.  This is only an issues for users who wrote their own 
losses for GBTs.
+    * *(Breaking change)* The `apply` and `copy` methods for the case class 
[`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy)
 have been changed because of a modification to the case class fields.  This 
could be an issue for users who use `BoostingStrategy` to set GBT parameters.
+* *(Breaking change)* The return value of 
[`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has 
changed.  It now returns an abstract class `LDAModel` instead of the concrete 
class `DistributedLDAModel`.  The object of type `LDAModel` can still be cast 
to the appropriate concrete type, which depends on the optimization algorithm.
+
+In the `spark.ml` package, several major API changes occurred, including:
+
+* `Param` and other APIs for specifying parameters
+* `uid` unique IDs for Pipeline components
+* Reorganization of certain classes
+
+Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list 
all changes here.
+However, since 1.4 `spark.ml` is no longer an alpha component, we will provide 
details on any API
+changes for future releases.
+
+## From 1.2 to 1.3
+
+In the `spark.mllib` package, there were several breaking changes.  The first 
change (in `ALS`) is the only one in a component not marked as Alpha or 
Experimental.
+
+* *(Breaking change)* In 
[`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the 
extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` 
method `analyzeBlocks` was also removed.
+* *(Breaking change)* 
[`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel)
 remains an Alpha component. In it, the `variance` method has been replaced 
with the `std` method.  To compute the column variance values returned by the 
original `variance` method, simply square the standard deviation values 
returned by `std`.
+* *(Breaking change)* 
[`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD)
 remains an Experimental component.  In it, there were two changes:
+    * The constructor taking arguments was removed in favor of a builder 
pattern using the default constructor plus parameter setter methods.
+    * Variable `model` is no longer public.
+* *(Breaking change)* 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) 
remains an Experimental component.  In it and its associated classes, there 
were several changes:
+    * In `DecisionTree`, the deprecated class method `train` has been removed. 
 (The object/static `train` methods remain.)
+    * In `Strategy`, the `checkpointDir` parameter has been removed.  
Checkpointing is still supported, but the checkpoint directory must be set 
before calling tree and tree ensemble training.
+* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was 
a public API but is now private, declared `private[python]`.  This was never 
meant for external use.
+* In linear regression (including Lasso and ridge regression), the squared 
loss is now divided by 2.
+  So in order to produce the same result as in 1.2, the regularization 
parameter needs to be divided by 2 and the step size needs to be multiplied by 
2.
+
+In the `spark.ml` package, the main API changes are from Spark SQL.  We list 
the most important changes here:
+
+* The old 
[SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD)
 has been replaced with 
[DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a 
somewhat modified API.  All algorithms in `spark.ml` which used to use 
SchemaRDD now use DataFrame.
+* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` 
into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an 
instance of `SQLContext`.  These implicits have been moved, so we now call 
`import sqlContext.implicits._`.
+* Java APIs for SQL have also changed accordingly.  Please see the examples 
above and the [Spark SQL Programming Guide](sql-programming-guide.html) for 
details.
+
+Other changes were in `LogisticRegression`:
+
+* The `scoreCol` output column (with default value "score") was renamed to be 
`probabilityCol` (with default value "probability").  The type was originally 
`Double` (for the probability of class 1.0), but it is now `Vector` (for the 
probability of each class, to support multiclass classification in the future).
+* In Spark 1.2, `LogisticRegressionModel` did not include an intercept.  In 
Spark 1.3, it includes an intercept; however, it will always be 0.0 since it 
uses the default settings for 
[spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS).
  The option to use an intercept will be added in the future.
+
+## From 1.1 to 1.2
+
+The only API changes in MLlib v1.2 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.2:
+
+1. *(Breaking change)* The Scala API for classification takes a named argument 
specifying the number
+of classes.  In MLlib v1.1, this argument was called `numClasses` in Python and
+`numClassesForClassification` in Scala.  In MLlib v1.2, the names are both set 
to `numClasses`.
+This `numClasses` parameter is specified either via
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Breaking change)* The API for
+[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has 
changed.
+This should generally not affect user code, unless the user manually 
constructs decision trees
+(instead of using the `trainClassifier` or `trainRegressor` methods).
+The tree `Node` now includes more information, including the probability of 
the predicted label
+(for classification).
+
+3. Printing methods' output has changed.  The `toString` (Scala/Java) and 
`__repr__` (Python) methods used to print the full model; they now print a 
summary.  For the full model, use `toDebugString`.
+
+Examples in the Spark distribution and examples in the
+[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated 
accordingly.
+
+## From 1.0 to 1.1
+
+The only API changes in MLlib v1.1 are in
+[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+which continues to be an experimental API in MLlib 1.1:
+
+1. *(Breaking change)* The meaning of tree depth has been changed by 1 in 
order to match
+the implementations of trees in
+[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
+and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
+In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root 
node and 2 leaf nodes.
+In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root 
node and 2 leaf nodes.
+This depth is specified by the `maxDepth` parameter in
+[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
+or via 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
+static `trainClassifier` and `trainRegressor` methods.
+
+2. *(Non-breaking change)* We recommend using the newly added 
`trainClassifier` and `trainRegressor`
+methods to build a 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
+rather than using the old parameter class `Strategy`.  These new training 
methods explicitly
+separate classification and regression, and they replace specialized parameter 
types with
+simple `String` types.
+
+Examples of the new, recommended `trainClassifier` and `trainRegressor` are 
given in the
+[Decision Trees Guide](mllib-decision-tree.html#examples).
+
+## From 0.9 to 1.0
+
+In MLlib v1.0, we support both dense and sparse input in a unified way, which 
introduces a few
+breaking changes.  If your data is sparse, please store it in a sparse format 
instead of dense to
+take advantage of sparsity in both storage and computation. Details are 
described below.
+

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-pipeline.md
----------------------------------------------------------------------
diff --git a/docs/ml-pipeline.md b/docs/ml-pipeline.md
new file mode 100644
index 0000000..adb057b
--- /dev/null
+++ b/docs/ml-pipeline.md
@@ -0,0 +1,245 @@
+---
+layout: global
+title: ML Pipelines
+displayTitle: ML Pipelines
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+In this section, we introduce the concept of ***ML Pipelines***.
+ML Pipelines provide a uniform set of high-level APIs built on top of
+[DataFrames](sql-programming-guide.html) that help users create and tune 
practical
+machine learning pipelines.
+
+**Table of Contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Main concepts in Pipelines
+
+MLlib standardizes APIs for machine learning algorithms to make it easier to 
combine multiple
+algorithms into a single pipeline, or workflow.
+This section covers the key concepts introduced by the Pipelines API, where 
the pipeline concept is
+mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
+
+* **[`DataFrame`](ml-guide.html#dataframe)**: This ML API uses `DataFrame` 
from Spark SQL as an ML
+  dataset, which can hold a variety of data types.
+  E.g., a `DataFrame` could have different columns storing text, feature 
vectors, true labels, and predictions.
+
+* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an 
algorithm which can transform one `DataFrame` into another `DataFrame`.
+E.g., an ML model is a `Transformer` which transforms a `DataFrame` with 
features into a `DataFrame` with predictions.
+
+* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm 
which can be fit on a `DataFrame` to produce a `Transformer`.
+E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and 
produces a model.
+
+* **[`Pipeline`](ml-guide.html#pipeline)**: A `Pipeline` chains multiple 
`Transformer`s and `Estimator`s together to specify an ML workflow.
+
+* **[`Parameter`](ml-guide.html#parameters)**: All `Transformer`s and 
`Estimator`s now share a common API for specifying parameters.
+
+## DataFrame
+
+Machine learning can be applied to a wide variety of data types, such as 
vectors, text, images, and structured data.
+This API adopts the `DataFrame` from Spark SQL in order to support a variety 
of data types.
+
+`DataFrame` supports many basic and structured types; see the [Spark SQL 
datatype reference](sql-programming-guide.html#spark-sql-datatype-reference) 
for a list of supported types.
+In addition to the types listed in the Spark SQL guide, `DataFrame` can use ML 
[`Vector`](mllib-data-types.html#local-vector) types.
+
+A `DataFrame` can be created either implicitly or explicitly from a regular 
`RDD`.  See the code examples below and the [Spark SQL programming 
guide](sql-programming-guide.html) for examples.
+
+Columns in a `DataFrame` are named.  The code examples below use names such as 
"text," "features," and "label."
+
+## Pipeline components
+
+### Transformers
+
+A `Transformer` is an abstraction that includes feature transformers and 
learned models.
+Technically, a `Transformer` implements a method `transform()`, which converts 
one `DataFrame` into
+another, generally by appending one or more columns.
+For example:
+
+* A feature transformer might take a `DataFrame`, read a column (e.g., text), 
map it into a new
+  column (e.g., feature vectors), and output a new `DataFrame` with the mapped 
column appended.
+* A learning model might take a `DataFrame`, read the column containing 
feature vectors, predict the
+  label for each feature vector, and output a new `DataFrame` with predicted 
labels appended as a
+  column.
+
+### Estimators
+
+An `Estimator` abstracts the concept of a learning algorithm or any algorithm 
that fits or trains on
+data.
+Technically, an `Estimator` implements a method `fit()`, which accepts a 
`DataFrame` and produces a
+`Model`, which is a `Transformer`.
+For example, a learning algorithm such as `LogisticRegression` is an 
`Estimator`, and calling
+`fit()` trains a `LogisticRegressionModel`, which is a `Model` and hence a 
`Transformer`.
+
+### Properties of pipeline components
+
+`Transformer.transform()`s and `Estimator.fit()`s are both stateless.  In the 
future, stateful algorithms may be supported via alternative concepts.
+
+Each instance of a `Transformer` or `Estimator` has a unique ID, which is 
useful in specifying parameters (discussed below).
+
+## Pipeline
+
+In machine learning, it is common to run a sequence of algorithms to process 
and learn from data.
+E.g., a simple text document processing workflow might include several stages:
+
+* Split each document's text into words.
+* Convert each document's words into a numerical feature vector.
+* Learn a prediction model using the feature vectors and labels.
+
+MLlib represents such a workflow as a `Pipeline`, which consists of a sequence 
of
+`PipelineStage`s (`Transformer`s and `Estimator`s) to be run in a specific 
order.
+We will use this simple workflow as a running example in this section.
+
+### How it works
+
+A `Pipeline` is specified as a sequence of stages, and each stage is either a 
`Transformer` or an `Estimator`.
+These stages are run in order, and the input `DataFrame` is transformed as it 
passes through each stage.
+For `Transformer` stages, the `transform()` method is called on the 
`DataFrame`.
+For `Estimator` stages, the `fit()` method is called to produce a 
`Transformer` (which becomes part of the `PipelineModel`, or fitted 
`Pipeline`), and that `Transformer`'s `transform()` method is called on the 
`DataFrame`.
+
+We illustrate this for the simple text document workflow.  The figure below is 
for the *training time* usage of a `Pipeline`.
+
+<p style="text-align: center;">
+  <img
+    src="img/ml-Pipeline.png"
+    title="ML Pipeline Example"
+    alt="ML Pipeline Example"
+    width="80%"
+  />
+</p>
+
+Above, the top row represents a `Pipeline` with three stages.
+The first two (`Tokenizer` and `HashingTF`) are `Transformer`s (blue), and the 
third (`LogisticRegression`) is an `Estimator` (red).
+The bottom row represents data flowing through the pipeline, where cylinders 
indicate `DataFrame`s.
+The `Pipeline.fit()` method is called on the original `DataFrame`, which has 
raw text documents and labels.
+The `Tokenizer.transform()` method splits the raw text documents into words, 
adding a new column with words to the `DataFrame`.
+The `HashingTF.transform()` method converts the words column into feature 
vectors, adding a new column with those vectors to the `DataFrame`.
+Now, since `LogisticRegression` is an `Estimator`, the `Pipeline` first calls 
`LogisticRegression.fit()` to produce a `LogisticRegressionModel`.
+If the `Pipeline` had more stages, it would call the 
`LogisticRegressionModel`'s `transform()`
+method on the `DataFrame` before passing the `DataFrame` to the next stage.
+
+A `Pipeline` is an `Estimator`.
+Thus, after a `Pipeline`'s `fit()` method runs, it produces a `PipelineModel`, 
which is a
+`Transformer`.
+This `PipelineModel` is used at *test time*; the figure below illustrates this 
usage.
+
+<p style="text-align: center;">
+  <img
+    src="img/ml-PipelineModel.png"
+    title="ML PipelineModel Example"
+    alt="ML PipelineModel Example"
+    width="80%"
+  />
+</p>
+
+In the figure above, the `PipelineModel` has the same number of stages as the 
original `Pipeline`, but all `Estimator`s in the original `Pipeline` have 
become `Transformer`s.
+When the `PipelineModel`'s `transform()` method is called on a test dataset, 
the data are passed
+through the fitted pipeline in order.
+Each stage's `transform()` method updates the dataset and passes it to the 
next stage.
+
+`Pipeline`s and `PipelineModel`s help to ensure that training and test data go 
through identical feature processing steps.
+
+### Details
+
+*DAG `Pipeline`s*: A `Pipeline`'s stages are specified as an ordered array.  
The examples given here are all for linear `Pipeline`s, i.e., `Pipeline`s in 
which each stage uses data produced by the previous stage.  It is possible to 
create non-linear `Pipeline`s as long as the data flow graph forms a Directed 
Acyclic Graph (DAG).  This graph is currently specified implicitly based on the 
input and output column names of each stage (generally specified as 
parameters).  If the `Pipeline` forms a DAG, then the stages must be specified 
in topological order.
+
+*Runtime checking*: Since `Pipeline`s can operate on `DataFrame`s with varied 
types, they cannot use
+compile-time type checking.
+`Pipeline`s and `PipelineModel`s instead do runtime checking before actually 
running the `Pipeline`.
+This type checking is done using the `DataFrame` *schema*, a description of 
the data types of columns in the `DataFrame`.
+
+*Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances.  
E.g., the same instance
+`myHashingTF` should not be inserted into the `Pipeline` twice since 
`Pipeline` stages must have
+unique IDs.  However, different instances `myHashingTF1` and `myHashingTF2` 
(both of type `HashingTF`)
+can be put into the same `Pipeline` since different instances will be created 
with different IDs.
+
+## Parameters
+
+MLlib `Estimator`s and `Transformer`s use a uniform API for specifying 
parameters.
+
+A `Param` is a named parameter with self-contained documentation.
+A `ParamMap` is a set of (parameter, value) pairs.
+
+There are two main ways to pass parameters to an algorithm:
+
+1. Set parameters for an instance.  E.g., if `lr` is an instance of 
`LogisticRegression`, one could
+   call `lr.setMaxIter(10)` to make `lr.fit()` use at most 10 iterations.
+   This API resembles the API used in `spark.mllib` package.
+2. Pass a `ParamMap` to `fit()` or `transform()`.  Any parameters in the 
`ParamMap` will override parameters previously specified via setter methods.
+
+Parameters belong to specific instances of `Estimator`s and `Transformer`s.
+For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, 
then we can build a `ParamMap` with both `maxIter` parameters specified: 
`ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
+This is useful if there are two algorithms with the `maxIter` parameter in a 
`Pipeline`.
+
+## Saving and Loading Pipelines
+
+Often times it is worth it to save a model or a pipeline to disk for later 
use. In Spark 1.6, a model import/export functionality was added to the 
Pipeline API. Most basic transformers are supported as well as some of the more 
basic ML models. Please refer to the algorithm's API documentation to see if 
saving and loading is supported.
+
+# Code examples
+
+This section gives code examples illustrating the functionality discussed 
above.
+For more info, please refer to the API documentation
+([Scala](api/scala/index.html#org.apache.spark.ml.package),
+[Java](api/java/org/apache/spark/ml/package-summary.html),
+and [Python](api/python/pyspark.ml.html)).
+
+## Example: Estimator, Transformer, and Param
+
+This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example 
scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example 
java/org/apache/spark/examples/ml/JavaEstimatorTransformerParamExample.java %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/estimator_transformer_param_example.py %}
+</div>
+
+</div>
+
+## Example: Pipeline
+
+This example follows the simple text document `Pipeline` illustrated in the 
figures above.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
+</div>
+
+<div data-lang="java">
+{% include_example java/org/apache/spark/examples/ml/JavaPipelineExample.java 
%}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/pipeline_example.py %}
+</div>
+
+</div>
+
+## Model selection (hyperparameter tuning)
+
+A big benefit of using ML Pipelines is hyperparameter optimization.  See the 
[ML Tuning Guide](ml-tuning.html) for more information on automatic model 
selection.

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-survival-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-survival-regression.md b/docs/ml-survival-regression.md
index 856ceb2..efa3c21 100644
--- a/docs/ml-survival-regression.md
+++ b/docs/ml-survival-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Survival Regression - spark.ml
-displayTitle: Survival Regression - spark.ml
+title: Survival Regression
+displayTitle: Survival Regression
 ---
 
   > This section has been moved into the

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/ml-tuning.md
----------------------------------------------------------------------
diff --git a/docs/ml-tuning.md b/docs/ml-tuning.md
new file mode 100644
index 0000000..2ca90c7
--- /dev/null
+++ b/docs/ml-tuning.md
@@ -0,0 +1,121 @@
+---
+layout: global
+title: "ML Tuning"
+displayTitle: "ML Tuning: model selection and hyperparameter tuning"
+---
+
+`\[
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\E}{\mathbb{E}}
+\newcommand{\x}{\mathbf{x}}
+\newcommand{\y}{\mathbf{y}}
+\newcommand{\wv}{\mathbf{w}}
+\newcommand{\av}{\mathbf{\alpha}}
+\newcommand{\bv}{\mathbf{b}}
+\newcommand{\N}{\mathbb{N}}
+\newcommand{\id}{\mathbf{I}}
+\newcommand{\ind}{\mathbf{1}}
+\newcommand{\0}{\mathbf{0}}
+\newcommand{\unit}{\mathbf{e}}
+\newcommand{\one}{\mathbf{1}}
+\newcommand{\zero}{\mathbf{0}}
+\]`
+
+This section describes how to use MLlib's tooling for tuning ML algorithms and 
Pipelines.
+Built-in Cross-Validation and other tooling allow users to optimize 
hyperparameters in algorithms and Pipelines.
+
+**Table of contents**
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+# Model selection (a.k.a. hyperparameter tuning)
+
+An important task in ML is *model selection*, or using data to find the best 
model or parameters for a given task.  This is also called *tuning*.
+Tuning may be done for individual `Estimator`s such as `LogisticRegression`, 
or for entire `Pipeline`s which include multiple algorithms, featurization, and 
other steps.  Users can tune an entire `Pipeline` at once, rather than tuning 
each element in the `Pipeline` separately.
+
+MLlib supports model selection using tools such as 
[`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator)
 and 
[`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit).
+These tools require the following items:
+
+* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm 
or `Pipeline` to tune
+* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter 
grid" to search over
+* 
[`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): 
metric to measure how well a fitted `Model` does on held-out test data
+
+At a high level, these model selection tools work as follows:
+
+* They split the input data into separate training and test datasets.
+* For each (training, test) pair, they iterate through the set of `ParamMap`s:
+  * For each `ParamMap`, they fit the `Estimator` using those parameters, get 
the fitted `Model`, and evaluate the `Model`'s performance using the 
`Evaluator`.
+* They select the `Model` produced by the best-performing set of parameters.
+
+The `Evaluator` can be a 
[`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
+for regression problems, a 
[`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
+for binary data, or a 
[`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
+for multiclass problems. The default metric used to choose the best `ParamMap` 
can be overridden by the `setMetricName`
+method in each of these evaluators.
+
+To help construct the parameter grid, users can use the 
[`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder)
 utility.
+
+# Cross-Validation
+
+`CrossValidator` begins by splitting the dataset into a set of *folds* which 
are used as separate training and test datasets. E.g., with `$k=3$` folds, 
`CrossValidator` will generate 3 (training, test) dataset pairs, each of which 
uses 2/3 of the data for training and 1/3 for testing.  To evaluate a 
particular `ParamMap`, `CrossValidator` computes the average evaluation metric 
for the 3 `Model`s produced by fitting the `Estimator` on the 3 different 
(training, test) dataset pairs.
+
+After identifying the best `ParamMap`, `CrossValidator` finally re-fits the 
`Estimator` using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via cross-validation
+
+The following example demonstrates using `CrossValidator` to select from a 
grid of parameters.
+
+Note that cross-validation over a grid of parameters is expensive.
+E.g., in the example below, the parameter grid has 3 values for 
`hashingTF.numFeatures` and 2 values for `lr.regParam`, and `CrossValidator` 
uses 2 folds.  This multiplies out to `$(3 \times 2) \times 2 = 12$` different 
models being trained.
+In realistic settings, it can be common to try many more parameters and use 
more folds (`$k=3$` and `$k=10$` are common).
+In other words, using `CrossValidator` can be very expensive.
+However, it is also a well-established method for choosing parameters which is 
more statistically sound than heuristic hand-tuning.
+
+<div class="codetabs">
+
+<div data-lang="scala">
+{% include_example 
scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala
 %}
+</div>
+
+<div data-lang="java">
+{% include_example 
java/org/apache/spark/examples/ml/JavaModelSelectionViaCrossValidationExample.java
 %}
+</div>
+
+<div data-lang="python">
+
+{% include_example python/ml/cross_validator.py %}
+</div>
+
+</div>
+
+# Train-Validation Split
+
+In addition to  `CrossValidator` Spark also offers `TrainValidationSplit` for 
hyper-parameter tuning.
+`TrainValidationSplit` only evaluates each combination of parameters once, as 
opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
+ but will not produce as reliable results when the training dataset is not 
sufficiently large.
+
+Unlike `CrossValidator`, `TrainValidationSplit` creates a single (training, 
test) dataset pair.
+It splits the dataset into these two parts using the `trainRatio` parameter. 
For example with `$trainRatio=0.75$`,
+`TrainValidationSplit` will generate a training and test dataset pair where 
75% of the data is used for training and 25% for validation.
+
+Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` 
using the best `ParamMap` and the entire dataset.
+
+## Example: model selection via train validation split
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example 
scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala
 %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example 
java/org/apache/spark/examples/ml/JavaModelSelectionViaTrainValidationSplitExample.java
 %}
+</div>
+
+<div data-lang="python">
+{% include_example python/ml/train_validation_split.py %}
+</div>
+
+</div>

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-classification-regression.md 
b/docs/mllib-classification-regression.md
index aaf8bd4..a7b90de 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Classification and Regression - spark.mllib
-displayTitle: Classification and Regression - spark.mllib
+title: Classification and Regression - RDD-based API
+displayTitle: Classification and Regression - RDD-based API
 ---
 
 The `spark.mllib` package supports various methods for 

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 073927c..d5f6ae3 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Clustering - spark.mllib
-displayTitle: Clustering - spark.mllib
+title: Clustering - RDD-based API
+displayTitle: Clustering - RDD-based API
 ---
 
 [Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an 
unsupervised learning problem whereby we aim to group subsets

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-collaborative-filtering.md 
b/docs/mllib-collaborative-filtering.md
index 5c33292..0f891a0 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Collaborative Filtering - spark.mllib
-displayTitle: Collaborative Filtering - spark.mllib
+title: Collaborative Filtering - RDD-based API
+displayTitle: Collaborative Filtering - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-data-types.md
----------------------------------------------------------------------
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index ef56aeb..7dd3c97 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Data Types - MLlib
-displayTitle: Data Types - MLlib
+title: Data Types - RDD-based API
+displayTitle: Data Types - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 11f5de1..0e753b8 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Decision Trees - spark.mllib
-displayTitle: Decision Trees - spark.mllib
+title: Decision Trees - RDD-based API
+displayTitle: Decision Trees - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-dimensionality-reduction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index cceddce..539cbc1 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Dimensionality Reduction - spark.mllib
-displayTitle: Dimensionality Reduction - spark.mllib
+title: Dimensionality Reduction - RDD-based API
+displayTitle: Dimensionality Reduction - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 5543262..e1984b6 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Ensembles - spark.mllib
-displayTitle: Ensembles - spark.mllib
+title: Ensembles - RDD-based API
+displayTitle: Ensembles - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-evaluation-metrics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index c49bc4f..ac82f43 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Evaluation Metrics - spark.mllib
-displayTitle: Evaluation Metrics - spark.mllib
+title: Evaluation Metrics - RDD-based API
+displayTitle: Evaluation Metrics - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 67c033e..867be7f 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Feature Extraction and Transformation - spark.mllib
-displayTitle: Feature Extraction and Transformation - spark.mllib
+title: Feature Extraction and Transformation - RDD-based API
+displayTitle: Feature Extraction and Transformation - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index a7b55dc..93e3f0b 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Frequent Pattern Mining - spark.mllib
-displayTitle: Frequent Pattern Mining - spark.mllib
+title: Frequent Pattern Mining - RDD-based API
+displayTitle: Frequent Pattern Mining - RDD-based API
 ---
 
 Mining frequent items, itemsets, subsequences, or other substructures is 
usually among the


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[2/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide

Reply via email to