Repository: spark
Updated Branches:
  refs/heads/master 28ccf5ee7 -> 59536cc87


[SPARK-5912] [docs] [mllib] Small fixes to ChiSqSelector docs

Fixes:
* typo in Scala example
* Removed comment "usually applied on sparse data" since that is debatable
* small edits to text for clarity

CC: avulanov  I noticed a typo post-hoc and ended up making a few small edits.  
Do the changes look OK?

Author: Joseph K. Bradley <jos...@databricks.com>

Closes #4732 from jkbradley/chisqselector-docs and squashes the following 
commits:

9656a3b [Joseph K. Bradley] added Java example for ChiSqSelector to guide
3f3f9f4 [Joseph K. Bradley] small fixes to ChiSqSelector docs


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/59536cc8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/59536cc8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/59536cc8

Branch: refs/heads/master
Commit: 59536cc87e10e5011560556729dd901280958f43
Parents: 28ccf5e
Author: Joseph K. Bradley <jos...@databricks.com>
Authored: Mon Feb 23 16:15:57 2015 -0800
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Mon Feb 23 16:15:57 2015 -0800

----------------------------------------------------------------------
 docs/mllib-feature-extraction.md | 72 +++++++++++++++++++++++++++++------
 1 file changed, 60 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/59536cc8/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index d588b9c..80842b2 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -377,27 +377,27 @@ data2 = labels.zip(normalizer2.transform(features))
 </div>
 
 ## Feature selection
-[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
+[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows 
selecting the most relevant features for use in model construction. Feature 
selection reduces the size of the vector space and, in turn, the complexity of 
any subsequent operation with vectors. The number of features to select can be 
tuned using a held-out validation set.
 
 ### ChiSqSelector
-ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 stands for Chi-Squared feature selection. It operates on labeled data with 
categorical features. `ChiSqSelector` orders features based on a Chi-Squared 
test of independence from the class, and then filters (selects) the top 
features which are most closely related to the label.
 
 #### Model Fitting
 
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
 following parameters in the constructor:
 
-* `numTopFeatures` number of top features that selector will select (filter).
+* `numTopFeatures` number of top features that the selector will select 
(filter).
 
 We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
 `ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
-return a model which can transform the input dataset into the reduced feature 
space.
+return a `ChiSqSelectorModel` which can transform an input dataset into the 
reduced feature space.
 
 This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
 which can apply the Chi-Squared feature selection on a `Vector` to produce a 
reduced `Vector` or on
 an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
 
-Note that the model that performs actual feature filtering can be instantiated 
independently with array of feature indices that has to be sorted ascending.
+Note that the user can also construct a `ChiSqSelectorModel` by hand by 
providing an array of selected feature indices (which must be sorted in 
ascending order).
 
 #### Example
 
@@ -411,21 +411,69 @@ import org.apache.spark.mllib.linalg.Vectors
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.util.MLUtils
 
-// load some data in libsvm format, each point is in the range 0..255
+// Load some data in libsvm format
 val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
-// discretize data in 16 equal bins
+// Discretize data in 16 equal bins since ChiSqSelector requires categorical 
features
 val discretizedData = data.map { lp =>
   LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x => x / 16 } 
) )
 }
-// create ChiSqSelector that will select 50 features
+// Create ChiSqSelector that will select 50 features
 val selector = new ChiSqSelector(50)
-// create ChiSqSelector model
-val transformer = selector.fit(disctetizedData)
-// filter top 50 features from each feature vector
-val filteredData = disctetizedData.map { lp => 
+// Create ChiSqSelector model (selecting features)
+val transformer = selector.fit(discretizedData)
+// Filter the top 50 features from each feature vector
+val filteredData = discretizedData.map { lp => 
   LabeledPoint(lp.label, transformer.transform(lp.features)) 
 }
 {% endhighlight %}
 </div>
+
+<div data-lang="java">
+{% highlight java %}
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.mllib.feature.ChiSqSelector;
+import org.apache.spark.mllib.feature.ChiSqSelectorModel;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.util.MLUtils;
+
+SparkConf sparkConf = new SparkConf().setAppName("JavaChiSqSelector");
+JavaSparkContext sc = new JavaSparkContext(sparkConf);
+JavaRDD<LabeledPoint> points = MLUtils.loadLibSVMFile(sc.sc(),
+    "data/mllib/sample_libsvm_data.txt").toJavaRDD().cache();
+
+// Discretize data in 16 equal bins since ChiSqSelector requires categorical 
features
+JavaRDD<LabeledPoint> discretizedData = points.map(
+    new Function<LabeledPoint, LabeledPoint>() {
+      @Override
+      public LabeledPoint call(LabeledPoint lp) {
+        final double[] discretizedFeatures = new double[lp.features().size()];
+        for (int i = 0; i < lp.features().size(); ++i) {
+          discretizedFeatures[i] = lp.features().apply(i) / 16;
+        }
+        return new LabeledPoint(lp.label(), 
Vectors.dense(discretizedFeatures));
+      }
+    });
+
+// Create ChiSqSelector that will select 50 features
+ChiSqSelector selector = new ChiSqSelector(50);
+// Create ChiSqSelector model (selecting features)
+final ChiSqSelectorModel transformer = selector.fit(discretizedData.rdd());
+// Filter the top 50 features from each feature vector
+JavaRDD<LabeledPoint> filteredData = discretizedData.map(
+    new Function<LabeledPoint, LabeledPoint>() {
+      @Override
+      public LabeledPoint call(LabeledPoint lp) {
+        return new LabeledPoint(lp.label(), 
transformer.transform(lp.features()));
+      }
+    }
+);
+
+sc.stop();
+{% endhighlight %}
+</div>
 </div>
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to