spark git commit: [SPARK-20506][DOCS] 2.2 migration guide

mlnick Fri, 19 May 2017 11:50:25 -0700

Repository: spark
Updated Branches:
  refs/heads/master 7f203a248 -> b5d8d9ba1



[SPARK-20506][DOCS] 2.2 migration guide

Update ML guide for migration `2.1` -> `2.2` and the previous version migration 
guide section.

## How was this patch tested?

Build doc locally.

Author: Nick Pentreath <ni...@za.ibm.com>

Closes #17996 from MLnick/SPARK-20506-2.2-migration-guide.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b5d8d9ba
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b5d8d9ba
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b5d8d9ba

Branch: refs/heads/master
Commit: b5d8d9ba17d62167cfbacd5f6188a8b4a5b8a2be
Parents: 7f203a2
Author: Nick Pentreath <ni...@za.ibm.com>
Authored: Fri May 19 20:51:56 2017 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Fri May 19 20:51:56 2017 +0200

----------------------------------------------------------------------
 docs/ml-guide.md            | 56 ++++++++++++++++++++++++++--------------
 docs/ml-migration-guides.md | 29 +++++++++++++++++++++
 2 files changed, 66 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/b5d8d9ba/docs/ml-guide.md
----------------------------------------------------------------------
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 9717619..362e883 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -26,7 +26,7 @@ The primary Machine Learning API for Spark is now the 
[DataFrame](sql-programmin
 * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
 * MLlib will not add new features to the RDD-based API.
 * In the Spark 2.x releases, MLlib will add features to the DataFrames-based 
API to reach feature parity with the RDD-based API.
-* After reaching feature parity (roughly estimated for Spark 2.2), the 
RDD-based API will be deprecated.
+* After reaching feature parity (roughly estimated for Spark 2.3), the 
RDD-based API will be deprecated.
 * The RDD-based API is expected to be removed in Spark 3.0.
 
 *Why is MLlib switching to the DataFrame-based API?*
@@ -66,41 +66,59 @@ To use MLlib in Python, you will need 
[NumPy](http://www.numpy.org) version 1.4
 [^1]: To learn more about the benefits and background of system optimised 
natives, you may wish to
     watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in 
Scala](http://fommil.github.io/scalax14/#/).
 
+# Highlights in 2.2
+
+The list below highlights some of the new features and enhancements added to 
MLlib in the `2.2`
+release of Spark:
+
+* `ALS` methods for _top-k_ recommendations for all users or items, matching 
the functionality
+ in `mllib` 
([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)). Performance
+ was also improved for both `ml` and `mllib`
+ ([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and
+ [SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587))
+* `Correlation` and `ChiSquareTest` stats functions for `DataFrames`
+ ([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and
+ [SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635))
+* `FPGrowth` algorithm for frequent pattern mining
+ ([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))
+* `GLM` now supports the full `Tweedie` family
+ ([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929))
+* `Imputer` feature transformer to impute missing values in a dataset
+ ([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568))
+* `LinearSVC` for linear Support Vector Machine classification
+ ([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709))
+* Logistic regression now supports constraints on the coefficients during 
training
+ ([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))
+
 # Migration guide
 
 MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
-## From 2.0 to 2.1
+## From 2.1 to 2.2
 
 ### Breaking changes
- 
-**Deprecated methods removed**
 
-* `setLabelCol` in `feature.ChiSqSelectorModel`
-* `numTrees` in `classification.RandomForestClassificationModel` (This now 
refers to the Param called `numTrees`)
-* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to 
the Param called `numTrees`)
-* `model` in `regression.LinearRegressionSummary`
-* `validateParams` in `PipelineStage`
-* `validateParams` in `Evaluator`
+There are no breaking changes.
 
 ### Deprecations and changes of behavior
 
 **Deprecations**
 
-* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
-  Deprecate all Param setter methods except for input/output column Params for 
`DecisionTreeClassificationModel`, `GBTClassificationModel`, 
`RandomForestClassificationModel`, `DecisionTreeRegressionModel`, 
`GBTRegressionModel` and `RandomForestRegressionModel`
+There are no deprecations.
 
 **Changes of behavior**
 
-* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
- Fix a bug of `ChiSqSelector` which will likely change its result. Now 
`ChiSquareSelector` use pValue rather than raw statistic to select a fixed 
number of top features.
-* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
- `KMeans` returns potentially fewer than k cluster centers in cases where k 
distinct centroids aren't available or aren't selected.
-* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
- `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| 
initialization mode.
-
+* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
+ Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` 
method (marked `DeveloperApi`).
+ **Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's 
`ALS` class.
+* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
+ Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
+* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
+ `StringIndexer` now handles `NULL` values in the same way as unseen values. 
Previously an exception
+ would always be thrown regardless of the setting of the `handleInvalid` 
parameter.
+  
 ## Previous Spark versions
 
 Earlier migration guides are archived [on this page](ml-migration-guides.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/b5d8d9ba/docs/ml-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/ml-migration-guides.md b/docs/ml-migration-guides.md
index 58c3747..687d7c8 100644
--- a/docs/ml-migration-guides.md
+++ b/docs/ml-migration-guides.md
@@ -7,6 +7,35 @@ description: MLlib migration guides from before Spark 
SPARK_VERSION_SHORT
 
 The migration guide for the current Spark version is kept on the [MLlib Guide 
main page](ml-guide.html#migration-guide).
 
+## From 2.0 to 2.1
+
+### Breaking changes
+ 
+**Deprecated methods removed**
+
+* `setLabelCol` in `feature.ChiSqSelectorModel`
+* `numTrees` in `classification.RandomForestClassificationModel` (This now 
refers to the Param called `numTrees`)
+* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to 
the Param called `numTrees`)
+* `model` in `regression.LinearRegressionSummary`
+* `validateParams` in `PipelineStage`
+* `validateParams` in `Evaluator`
+
+### Deprecations and changes of behavior
+
+**Deprecations**
+
+* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
+  Deprecate all Param setter methods except for input/output column Params for 
`DecisionTreeClassificationModel`, `GBTClassificationModel`, 
`RandomForestClassificationModel`, `DecisionTreeRegressionModel`, 
`GBTRegressionModel` and `RandomForestRegressionModel`
+
+**Changes of behavior**
+
+* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
+ Fix a bug of `ChiSqSelector` which will likely change its result. Now 
`ChiSquareSelector` use pValue rather than raw statistic to select a fixed 
number of top features.
+* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
+ `KMeans` returns potentially fewer than k cluster centers in cases where k 
distinct centroids aren't available or aren't selected.
+* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
+ `KMeans` reduces the default number of steps from 5 to 2 for the k-means|| 
initialization mode.
+
 ## From 1.6 to 2.0
 
 ### Breaking changes


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-20506][DOCS] 2.2 migration guide

Reply via email to