Repository: spark Updated Branches: refs/heads/master 4a8339568 -> b2a7eedcd
[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for #14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2...@gmail.com> Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b2a7eedc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b2a7eedc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b2a7eedc Branch: refs/heads/master Commit: b2a7eedcddf0e682ff46afd1b764d0b81ccdf395 Parents: 4a83395 Author: Shuai Lin <linshuai2...@gmail.com> Authored: Wed Sep 28 06:12:48 2016 -0400 Committer: Sean Owen <so...@cloudera.com> Committed: Wed Sep 28 06:12:48 2016 -0400 ---------------------------------------------------------------------- docs/ml-features.md | 14 ++++++++++---- docs/mllib-feature-extraction.md | 14 ++++++++++---- 2 files changed, 20 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/ml-features.md ---------------------------------------------------------------------- diff --git a/docs/ml-features.md b/docs/ml-features.md index a39b31c..a7f710f 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1331,10 +1331,16 @@ for more details on the API. ## ChiSqSelector `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with -categorical features. ChiSqSelector orders features based on a -[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) -from the class, and then filters (selects) the top features which the class label depends on the -most. This is akin to yielding the features with the most predictive power. +categorical features. ChiSqSelector uses the +[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which +features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: + +* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. +* `FPR` chooses all features whose false positive rate meets some threshold. + +By default, the selection method is `KBest`, the default number of top features is 50. User can use +`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. **Examples** http://git-wip-us.apache.org/repos/asf/spark/blob/b2a7eedc/docs/mllib-feature-extraction.md ---------------------------------------------------------------------- diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index 353d391..87e1e02 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -225,10 +225,16 @@ features for use in model construction. It reduces the size of the feature space both speed and statistical learning behavior. [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements -Chi-Squared feature selection. It operates on labeled data with categorical features. -`ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, -and then filters (selects) the top features which the class label depends on the most. -This is akin to yielding the features with the most predictive power. +Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the +[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which +features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: + +* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. +* `FPR` chooses all features whose false positive rate meets some threshold. + +By default, the selection method is `KBest`, the default number of top features is 50. User can use +`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. The number of features to select can be tuned using a held-out validation set. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org