[1/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide

jkbradley Fri, 15 Jul 2016 13:38:51 -0700

Repository: spark
Updated Branches:
  refs/heads/master 71ad945bb -> 5ffd5d383



http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 17fd3e1..30112c7 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -1,32 +1,12 @@
 ---
 layout: global
-title: MLlib
-displayTitle: Machine Learning Library (MLlib) Guide
-description: MLlib machine learning library overview for Spark 
SPARK_VERSION_SHORT
+title: "MLlib: RDD-based API"
+displayTitle: "MLlib: RDD-based API"
 ---
 
-MLlib is Spark's machine learning (ML) library.
-Its goal is to make practical machine learning scalable and easy.
-It consists of common learning algorithms and utilities, including 
classification, regression,
-clustering, collaborative filtering, dimensionality reduction, as well as 
lower-level optimization
-primitives and higher-level pipeline APIs.
-
-It divides into two packages:
-
-* [`spark.mllib`](mllib-guide.html#data-types-algorithms-and-utilities) 
contains the original API
-  built on top of 
[RDDs](programming-guide.html#resilient-distributed-datasets-rdds).
-* [`spark.ml`](ml-guide.html) provides higher-level API
-  built on top of [DataFrames](sql-programming-guide.html#dataframes) for 
constructing ML pipelines.
-
-Using `spark.ml` is recommended because with DataFrames the API is more 
versatile and flexible.
-But we will keep supporting `spark.mllib` along with the development of 
`spark.ml`.
-Users should be comfortable using `spark.mllib` features and expect more 
features coming.
-Developers should contribute new algorithms to `spark.ml` if they fit the ML 
pipeline concept well,
-e.g., feature extractors and transformers.
-
-We list major functionality from both below, with links to detailed guides.
-
-# spark.mllib: data types, algorithms, and utilities
+This page documents sections of the MLlib guide for the RDD-based API (the 
`spark.mllib` package).
+Please see the [MLlib Main Guide](ml-guide.html) for the DataFrame-based API 
(the `spark.ml` package),
+which is now the primary API for MLlib.
 
 * [Data types](mllib-data-types.html)
 * [Basic statistics](mllib-statistics.html)
@@ -65,192 +45,3 @@ We list major functionality from both below, with links to 
detailed guides.
   * [stochastic gradient 
descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
   * [limited-memory BFGS 
(L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
 
-# spark.ml: high-level APIs for ML pipelines
-
-* [Overview: estimators, transformers and pipelines](ml-guide.html)
-* [Extracting, transforming and selecting features](ml-features.html)
-* [Classification and regression](ml-classification-regression.html)
-* [Clustering](ml-clustering.html)
-* [Collaborative filtering](ml-collaborative-filtering.html)
-* [Advanced topics](ml-advanced.html)
-
-Some techniques are not available yet in spark.ml, most notably dimensionality 
reduction 
-Users can seamlessly combine the implementation of these techniques found in 
`spark.mllib` with the rest of the algorithms found in `spark.ml`.
-
-# Dependencies
-
-MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), 
which depends on
-[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical 
processing.
-If natives libraries[^1] are not available at runtime, you will see a warning 
message and a pure JVM
-implementation will be used instead.
-
-Due to licensing issues with runtime proprietary binaries, we do not include 
`netlib-java`'s native
-proxies by default.
-To configure `netlib-java` / Breeze to use system optimised binaries, include
-`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as 
a dependency of your
-project and read the [netlib-java](https://github.com/fommil/netlib-java) 
documentation for your
-platform's additional installation instructions.
-
-To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 
1.4 or newer.
-
-[^1]: To learn more about the benefits and background of system optimised 
natives, you may wish to
-    watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/#/).
-
-# Migration guide
-
-MLlib is under active development.
-The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
-and the migration guide below will explain all changes between releases.
-
-## From 1.6 to 2.0
-
-### Breaking changes
-
-There were several breaking changes in Spark 2.0, which are outlined below.
-
-**Linear algebra classes for DataFrame-based APIs**
-
-Spark's linear algebra dependencies were moved to a new project, `mllib-local` 
-(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
-As part of this change, the linear algebra classes were copied to a new 
package, `spark.ml.linalg`. 
-The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` 
classes, 
-leading to a few breaking changes, predominantly in various model classes 
-(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a 
full list).
-
-**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the 
previous package `spark.mllib.linalg`.
-
-_Converting vectors and matrices_
-
-While most pipeline components support backward compatibility for loading, 
-some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that 
contain vector or matrix 
-columns, may need to be migrated to the new `spark.ml` vector and matrix 
types. 
-Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to 
`spark.ml.linalg` types
-(and vice versa) can be found in `spark.mllib.util.MLUtils`.
-
-There are also utility methods available for converting single instances of 
-vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / 
`mllib.linalg.Matrix`
-for converting to `ml.linalg` types, and 
-`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
-for converting to `mllib.linalg` types.
-
-<div class="codetabs">
-<div data-lang="scala"  markdown="1">
-
-{% highlight scala %}
-import org.apache.spark.mllib.util.MLUtils
-
-// convert DataFrame columns
-val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-// convert a single vector or matrix
-val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
-{% endhighlight %}
-
-Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further 
detail.
-</div>
-
-<div data-lang="java" markdown="1">
-
-{% highlight java %}
-import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.sql.Dataset;
-
-// convert DataFrame columns
-Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
-Dataset<Row> convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF);
-// convert a single vector or matrix
-org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
-{% endhighlight %}
-
-Refer to the [`MLUtils` Java 
docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
-</div>
-
-<div data-lang="python"  markdown="1">
-
-{% highlight python %}
-from pyspark.mllib.util import MLUtils
-
-# convert DataFrame columns
-convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
-mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
-{% endhighlight %}
-
-Refer to the [`MLUtils` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further 
detail.
-</div>
-</div>
-
-**Deprecated methods removed**
-
-Several deprecated methods were removed in the `spark.mllib` and `spark.ml` 
packages:
-
-* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
-* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
-* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as 
`DeveloperApi`)
-* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these 
functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
-* `defaultStategy` in `mllib.tree.configuration.Strategy`
-* `build` in `mllib.tree.Node`
-* libsvm loaders for multiclass and load/save labeledData methods in 
`mllib.util.MLUtils`
-
-A full list of breaking changes can be found at 
[SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
-
-### Deprecations and changes of behavior
-
-**Deprecations**
-
-Deprecations in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
- In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been 
deprecated.
-* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
- In `spark.ml.regression.RandomForestRegressionModel` and 
`spark.ml.classification.RandomForestClassificationModel`,
- the `numTrees` parameter has been deprecated in favor of `getNumTrees` method.
-* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
- In `spark.ml.param.Params`, the `validateParams` method has been deprecated.
- We move all functionality in overridden methods to the corresponding 
`transformSchema`.
-* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
- In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, 
`RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
- We encourage users to use `spark.ml.regression.LinearRegresson` and 
`spark.ml.classification.LogisticRegresson`.
-* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
- In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, 
`recall` and `fMeasure` have been deprecated in favor of `accuracy`.
-* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
- In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` 
method has been deprecated in favor of `session`.
-* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been 
deprecated since it was not used by `ChiSqSelectorModel`.
-
-**Changes of behavior**
-
-Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
-
-* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
- `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls 
`spark.ml.classification.LogisticRegresson` for binary classification now.
- This will introduce the following behavior changes for 
`spark.mllib.classification.LogisticRegressionWithLBFGS`:
-    * The intercept will not be regularized when training binary 
classification model with L1/L2 Updater.
-    * If users set without regularization, training with or without feature 
scaling will return the same solution by the same convergence rate.
-* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
- In order to provide better and consistent result with 
`spark.ml.classification.LogisticRegresson`,
- the default value of 
`spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has 
been changed from 1E-4 to 1E-6.
-* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
- Fix a bug of `PowerIterationClustering` which will likely change its result.
-* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
- `LDA` using the `EM` optimizer will keep the last checkpoint by default, if 
checkpointing is being used.
-* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
- `Word2Vec` now respects sentence boundaries. Previously, it did not handle 
them correctly.
-* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
- `HashingTF` uses `MurmurHash3` as default hash algorithm in both `spark.ml` 
and `spark.mllib`.
-* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
- The `expectedType` argument for PySpark `Param` was removed.
-* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
- Some default `Param` values, which were mismatched between pipelines in Scala 
and Python, have been changed.
-* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
- `QuantileDiscretizer` now uses 
`spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously 
used custom sampling logic).
- The output buckets will differ for same input data and params.
-
-## Previous Spark versions
-
-Earlier migration guides are archived [on this 
page](mllib-migration-guides.html).
-
----

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-isotonic-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 8ede440..d90905a 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Isotonic regression - spark.mllib
-displayTitle: Regression - spark.mllib
+title: Isotonic regression - RDD-based API
+displayTitle: Regression - RDD-based API
 ---
 
 ## Isotonic regression

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 17d781a..6fcd3ae 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Linear Methods - spark.mllib
-displayTitle: Linear Methods - spark.mllib
+title: Linear Methods - RDD-based API
+displayTitle: Linear Methods - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index 970c669..ea6f93f 100644
--- a/docs/mllib-migration-guides.md
+++ b/docs/mllib-migration-guides.md
@@ -1,159 +1,9 @@
 ---
 layout: global
-title: Old Migration Guides - spark.mllib
-displayTitle: Old Migration Guides - spark.mllib
-description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
+title: Old Migration Guides - MLlib
+displayTitle: Old Migration Guides - MLlib
 ---
 
-The migration guide for the current Spark version is kept on the [MLlib 
Programming Guide main page](mllib-guide.html#migration-guide).
-
-## From 1.5 to 1.6
-
-There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, 
but there are
-deprecations and changes of behavior.
-
-Deprecations:
-
-* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
- In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
-* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
- In `spark.ml.classification.LogisticRegressionModel` and
- `spark.ml.regression.LinearRegressionModel`, the `weights` field has been 
deprecated in favor of
- the new name `coefficients`.  This helps disambiguate from instance (row) 
"weights" given to
- algorithms.
-
-Changes of behavior:
-
-* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
- `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed 
semantics in 1.6.
- Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of
- `GradientDescent`'s `convergenceTol`: For large errors, it uses relative 
error (relative to the
- previous error); for small errors (`< 0.01`), it uses absolute error.
-* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
- `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to 
lowercase before
- tokenizing. Now, it converts to lowercase by default, with an option not to. 
This matches the
- behavior of the simpler `Tokenizer` transformer.
-
-## From 1.4 to 1.5
-
-In the `spark.mllib` package, there are no breaking API changes but several 
behavior changes:
-
-* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
-  `RegressionMetrics.explainedVariance` returns the average regression sum of 
squares.
-* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): 
`NaiveBayesModel.labels` become
-  sorted.
-* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): 
`GradientDescent` has a default
-  convergence tolerance `1e-3`, and hence iterations might end earlier than 
1.4.
-
-In the `spark.ml` package, there exists one breaking API change and one 
behavior change:
-
-* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's 
varargs support is removed
-  from `Params.setDefault` due to a
-  [Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
-* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): 
`Evaluator.isLargerBetter` is
-  added to indicate metric ordering. Metrics like RMSE no longer flip signs as 
in 1.4.
-
-## From 1.3 to 1.4
-
-In the `spark.mllib` package, there were several breaking changes, but all in 
`DeveloperApi` or `Experimental` APIs:
-
-* Gradient-Boosted Trees
-    * *(Breaking change)* The signature of the 
[`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) 
method was changed.  This is only an issues for users who wrote their own 
losses for GBTs.
-    * *(Breaking change)* The `apply` and `copy` methods for the case class 
[`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy)
 have been changed because of a modification to the case class fields.  This 
could be an issue for users who use `BoostingStrategy` to set GBT parameters.
-* *(Breaking change)* The return value of 
[`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has 
changed.  It now returns an abstract class `LDAModel` instead of the concrete 
class `DistributedLDAModel`.  The object of type `LDAModel` can still be cast 
to the appropriate concrete type, which depends on the optimization algorithm.
-
-In the `spark.ml` package, several major API changes occurred, including:
-
-* `Param` and other APIs for specifying parameters
-* `uid` unique IDs for Pipeline components
-* Reorganization of certain classes
-
-Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list 
all changes here.
-However, since 1.4 `spark.ml` is no longer an alpha component, we will provide 
details on any API
-changes for future releases.
-
-## From 1.2 to 1.3
-
-In the `spark.mllib` package, there were several breaking changes.  The first 
change (in `ALS`) is the only one in a component not marked as Alpha or 
Experimental.
-
-* *(Breaking change)* In 
[`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the 
extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` 
method `analyzeBlocks` was also removed.
-* *(Breaking change)* 
[`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel)
 remains an Alpha component. In it, the `variance` method has been replaced 
with the `std` method.  To compute the column variance values returned by the 
original `variance` method, simply square the standard deviation values 
returned by `std`.
-* *(Breaking change)* 
[`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD)
 remains an Experimental component.  In it, there were two changes:
-    * The constructor taking arguments was removed in favor of a builder 
pattern using the default constructor plus parameter setter methods.
-    * Variable `model` is no longer public.
-* *(Breaking change)* 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) 
remains an Experimental component.  In it and its associated classes, there 
were several changes:
-    * In `DecisionTree`, the deprecated class method `train` has been removed. 
 (The object/static `train` methods remain.)
-    * In `Strategy`, the `checkpointDir` parameter has been removed.  
Checkpointing is still supported, but the checkpoint directory must be set 
before calling tree and tree ensemble training.
-* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was 
a public API but is now private, declared `private[python]`.  This was never 
meant for external use.
-* In linear regression (including Lasso and ridge regression), the squared 
loss is now divided by 2.
-  So in order to produce the same result as in 1.2, the regularization 
parameter needs to be divided by 2 and the step size needs to be multiplied by 
2.
-
-In the `spark.ml` package, the main API changes are from Spark SQL.  We list 
the most important changes here:
-
-* The old 
[SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD)
 has been replaced with 
[DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a 
somewhat modified API.  All algorithms in Spark ML which used to use SchemaRDD 
now use DataFrame.
-* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` 
into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an 
instance of `SQLContext`.  These implicits have been moved, so we now call 
`import sqlContext.implicits._`.
-* Java APIs for SQL have also changed accordingly.  Please see the examples 
above and the [Spark SQL Programming Guide](sql-programming-guide.html) for 
details.
-
-Other changes were in `LogisticRegression`:
-
-* The `scoreCol` output column (with default value "score") was renamed to be 
`probabilityCol` (with default value "probability").  The type was originally 
`Double` (for the probability of class 1.0), but it is now `Vector` (for the 
probability of each class, to support multiclass classification in the future).
-* In Spark 1.2, `LogisticRegressionModel` did not include an intercept.  In 
Spark 1.3, it includes an intercept; however, it will always be 0.0 since it 
uses the default settings for 
[spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS).
  The option to use an intercept will be added in the future.
-
-## From 1.1 to 1.2
-
-The only API changes in MLlib v1.2 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.2:
-
-1. *(Breaking change)* The Scala API for classification takes a named argument 
specifying the number
-of classes.  In MLlib v1.1, this argument was called `numClasses` in Python and
-`numClassesForClassification` in Scala.  In MLlib v1.2, the names are both set 
to `numClasses`.
-This `numClasses` parameter is specified either via
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Breaking change)* The API for
-[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has 
changed.
-This should generally not affect user code, unless the user manually 
constructs decision trees
-(instead of using the `trainClassifier` or `trainRegressor` methods).
-The tree `Node` now includes more information, including the probability of 
the predicted label
-(for classification).
-
-3. Printing methods' output has changed.  The `toString` (Scala/Java) and 
`__repr__` (Python) methods used to print the full model; they now print a 
summary.  For the full model, use `toDebugString`.
-
-Examples in the Spark distribution and examples in the
-[Decision Trees Guide](mllib-decision-tree.html#examples) have been updated 
accordingly.
-
-## From 1.0 to 1.1
-
-The only API changes in MLlib v1.1 are in
-[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-which continues to be an experimental API in MLlib 1.1:
-
-1. *(Breaking change)* The meaning of tree depth has been changed by 1 in 
order to match
-the implementations of trees in
-[scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree)
-and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
-In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root 
node and 2 leaf nodes.
-In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root 
node and 2 leaf nodes.
-This depth is specified by the `maxDepth` parameter in
-[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
-or via 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
-static `trainClassifier` and `trainRegressor` methods.
-
-2. *(Non-breaking change)* We recommend using the newly added 
`trainClassifier` and `trainRegressor`
-methods to build a 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
-rather than using the old parameter class `Strategy`.  These new training 
methods explicitly
-separate classification and regression, and they replace specialized parameter 
types with
-simple `String` types.
-
-Examples of the new, recommended `trainClassifier` and `trainRegressor` are 
given in the
-[Decision Trees Guide](mllib-decision-tree.html#examples).
-
-## From 0.9 to 1.0
-
-In MLlib v1.0, we support both dense and sparse input in a unified way, which 
introduces a few
-breaking changes.  If your data is sparse, please store it in a sparse format 
instead of dense to
-take advantage of sparsity in both storage and computation. Details are 
described below.
+The migration guide for the current Spark version is kept on the [MLlib Guide 
main page](ml-guide.html#migration-guide).
 
+Past migration guides are now stored at 
[ml-migration-guides.html](ml-migration-guides.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-naive-bayes.md
----------------------------------------------------------------------
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index d0d594a..7471d18 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Naive Bayes - spark.mllib
-displayTitle: Naive Bayes - spark.mllib
+title: Naive Bayes - RDD-based API
+displayTitle: Naive Bayes - RDD-based API
 ---
 
 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-optimization.md
----------------------------------------------------------------------
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index f90b66f..eefd7dc 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Optimization - spark.mllib
-displayTitle: Optimization - spark.mllib
+title: Optimization - RDD-based API
+displayTitle: Optimization - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-pmml-model-export.md
----------------------------------------------------------------------
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 7f2347d..d353090 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: PMML model export - spark.mllib
-displayTitle: PMML model export - spark.mllib
+title: PMML model export - RDD-based API
+displayTitle: PMML model export - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 329855e..12797bd 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Basic Statistics - spark.mllib
-displayTitle: Basic Statistics - spark.mllib
+title: Basic Statistics - RDD-based API
+displayTitle: Basic Statistics - RDD-based API
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2bc4912..888c12f 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1571,7 +1571,7 @@ have changed from returning (key, list of values) pairs 
to (key, iterable of val
 </div>
 
 Migration guides are also available for [Spark 
Streaming](streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x),
-[MLlib](mllib-guide.html#migration-guide) and 
[GraphX](graphx-programming-guide.html#migrating-from-spark-091).
+[MLlib](ml-guide.html#migration-guide) and 
[GraphX](graphx-programming-guide.html#migrating-from-spark-091).
 
 
 # Where to Go from Here

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 2ee3b80..de82a06 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -15,7 +15,7 @@ like Kafka, Flume, Kinesis, or TCP sockets, and can be 
processed using complex
 algorithms expressed with high-level functions like `map`, `reduce`, `join` 
and `window`.
 Finally, processed data can be pushed out to filesystems, databases,
 and live dashboards. In fact, you can apply Spark's
-[machine learning](mllib-guide.html) and
+[machine learning](ml-guide.html) and
 [graph processing](graphx-programming-guide.html) algorithms on data streams.
 
 <p style="text-align: center;">
@@ -1673,7 +1673,7 @@ See the [DataFrames and SQL](sql-programming-guide.html) 
guide to learn more abo
 ***
 
 ## MLlib Operations
-You can also easily use machine learning algorithms provided by 
[MLlib](mllib-guide.html). First of all, there are streaming machine learning 
algorithms (e.g. [Streaming Linear 
Regression](mllib-linear-methods.html#streaming-linear-regression), [Streaming 
KMeans](mllib-clustering.html#streaming-k-means), etc.) which can 
simultaneously learn from the streaming data as well as apply the model on the 
streaming data. Beyond these, for a much larger class of machine learning 
algorithms, you can learn a learning model offline (i.e. using historical data) 
and then apply the model online on streaming data. See the 
[MLlib](mllib-guide.html) guide for more details.
+You can also easily use machine learning algorithms provided by 
[MLlib](ml-guide.html). First of all, there are streaming machine learning 
algorithms (e.g. [Streaming Linear 
Regression](mllib-linear-methods.html#streaming-linear-regression), [Streaming 
KMeans](mllib-clustering.html#streaming-k-means), etc.) which can 
simultaneously learn from the streaming data as well as apply the model on the 
streaming data. Beyond these, for a much larger class of machine learning 
algorithms, you can learn a learning model offline (i.e. using historical data) 
and then apply the model online on streaming data. See the 
[MLlib](ml-guide.html) guide for more details.
 
 ***
 

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/ml/__init__.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/__init__.py b/python/pyspark/ml/__init__.py
index 05f3be5..1d42d49 100644
--- a/python/pyspark/ml/__init__.py
+++ b/python/pyspark/ml/__init__.py
@@ -16,8 +16,8 @@
 #
 
 """
-Spark ML is a component that adds a new set of machine learning APIs to let 
users quickly
-assemble and configure practical machine learning pipelines.
+DataFrame-based machine learning APIs to let users quickly assemble and 
configure practical
+machine learning pipelines.
 """
 from pyspark.ml.base import Estimator, Model, Transformer
 from pyspark.ml.pipeline import Pipeline, PipelineModel

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/ml/tests.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py
index 24efce8..4bcb2c4 100755
--- a/python/pyspark/ml/tests.py
+++ b/python/pyspark/ml/tests.py
@@ -16,7 +16,7 @@
 #
 
 """
-Unit tests for Spark ML Python APIs.
+Unit tests for MLlib Python DataFrame-based APIs.
 """
 import sys
 if sys.version > '3':

http://git-wip-us.apache.org/repos/asf/spark/blob/5ffd5d38/python/pyspark/mllib/__init__.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/__init__.py b/python/pyspark/mllib/__init__.py
index acba3a7..ae26521 100644
--- a/python/pyspark/mllib/__init__.py
+++ b/python/pyspark/mllib/__init__.py
@@ -16,7 +16,10 @@
 #
 
 """
-Python bindings for MLlib.
+RDD-based machine learning APIs for Python (in maintenance mode).
+
+The `pyspark.mllib` package is in maintenance mode as of the Spark 2.0.0 
release to encourage
+migration to the DataFrame-based APIs under the `pyspark.ml` package.
 """
 from __future__ import absolute_import
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[1/2] spark git commit: [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide

Reply via email to