spark git commit: [SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector

srowen Wed, 28 Sep 2016 03:13:41 -0700

Repository: spark
Updated Branches:
  refs/heads/master 4a8339568 -> b2a7eedcd



[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs 
for ChiSqSelector

## What changes were proposed in this pull request?

A follow up for #14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: 
http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin <linshuai2...@gmail.com>

Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2a7eedc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2a7eedc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2a7eedc

Branch: refs/heads/master
Commit: b2a7eedcddf0e682ff46afd1b764d0b81ccdf395
Parents: 4a83395
Author: Shuai Lin <linshuai2...@gmail.com>
Authored: Wed Sep 28 06:12:48 2016 -0400
Committer: Sean Owen <so...@cloudera.com>
Committed: Wed Sep 28 06:12:48 2016 -0400

----------------------------------------------------------------------
 docs/ml-features.md              | 14 ++++++++++----
 docs/mllib-feature-extraction.md | 14 ++++++++++----
 2 files changed, 20 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index a39b31c..a7f710f 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1331,10 +1331,16 @@ for more details on the API.
 ## ChiSqSelector
 
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
-categorical features. ChiSqSelector orders features based on a
-[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test)
-from the class, and then filters (selects) the top features which the class 
label depends on the
-most. This is akin to yielding the features with the most predictive power.
+categorical features. ChiSqSelector uses the
+[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` 
and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This 
is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features 
instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection 
methods.
 
 **Examples**
 

http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 353d391..87e1e02 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -225,10 +225,16 @@ features for use in model construction. It reduces the 
size of the feature space
 both speed and statistical learning behavior.
 
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
-Chi-Squared feature selection. It operates on labeled data with categorical 
features.
-`ChiSqSelector` orders features based on a Chi-Squared test of independence 
from the class,
-and then filters (selects) the top features which the class label depends on 
the most.
-This is akin to yielding the features with the most predictive power.
+Chi-Squared feature selection. It operates on labeled data with categorical 
features. ChiSqSelector uses the
+[Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
+features to choose. It supports three selection methods: `KBest`, `Percentile` 
and `FPR`:
+
+* `KBest` chooses the `k` top features according to a chi-squared test. This 
is akin to yielding the features with the most predictive power.
+* `Percentile` is similar to `KBest` but chooses a fraction of all features 
instead of a fixed number.
+* `FPR` chooses all features whose false positive rate meets some threshold.
+
+By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection 
methods.
 
 The number of features to select can be tuned using a held-out validation set.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector

Reply via email to