[spark] branch master updated: [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector

srowen Fri, 15 May 2020 08:01:24 -0700

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 194ac3b  [SPARK-31708][ML][DOCS] Add docs and examples for 
ANOVASelector and FValueSelector
194ac3b is described below

commit 194ac3be8bd8ca1b5e463074ed61420f185e8caf
Author: Huaxin Gao <huax...@us.ibm.com>
AuthorDate: Fri May 15 09:59:14 2020 -0500

    [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and 
FValueSelector
    
    ### What changes were proposed in this pull request?
    Add docs and examples for ANOVASelector and FValueSelector
    
    ### Why are the changes needed?
    Complete the implementation of ANOVASelector and FValueSelector
    
    ### Does this PR introduce _any_ user-facing change?
    Yes
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 17 44 PM" 
src="https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 05 15 PM" 
src="https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 06 06 PM" 
src="https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 06 21 PM" 
src="https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 07 10 PM" 
src="https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 07 33 PM" 
src="https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 07 47 PM" 
src="https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png";>
    
    <img width="851" alt="Screen Shot 2020-05-13 at 5 19 35 PM" 
src="https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 08 42 PM" 
src="https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 09 01 PM" 
src="https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 09 50 PM" 
src="https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png";>
    
    <img width="851" alt="Screen Shot 2020-05-13 at 5 10 06 PM" 
src="https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 10 59 PM" 
src="https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 11 27 PM" 
src="https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 14 54 PM" 
src="https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png";>
    
    <img width="850" alt="Screen Shot 2020-05-13 at 5 15 17 PM" 
src="https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png";>
    
    ### How was this patch tested?
    Manually build and check
    
    Closes #28524 from huaxingao/examples.
    
    Authored-by: Huaxin Gao <huax...@us.ibm.com>
    Signed-off-by: Sean Owen <sro...@gmail.com>
---
 docs/ml-features.md                                | 140 +++++++++++++++++++++
 docs/ml-statistics.md                              |  56 ++++++++-
 ...tExample.java => JavaANOVASelectorExample.java} |  48 +++----
 .../spark/examples/ml/JavaANOVATestExample.java    |   2 +-
 ...Example.java => JavaFValueSelectorExample.java} |  48 +++----
 .../spark/examples/ml/JavaFValueTestExample.java   |   2 +-
 ...a_test_example.py => anova_selector_example.py} |  35 +++---
 examples/src/main/python/ml/anova_test_example.py  |   2 +-
 ..._test_example.py => fvalue_selector_example.py} |  35 +++---
 examples/src/main/python/ml/fvalue_test_example.py |   2 +-
 ...estExample.scala => ANOVASelectorExample.scala} |  42 ++++---
 .../spark/examples/ml/ANOVATestExample.scala       |   2 +-
 ...stExample.scala => FValueSelectorExample.scala} |  42 ++++---
 ...ueTestExample.scala => FValueTestExample.scala} |   0
 14 files changed, 340 insertions(+), 116 deletions(-)

diff --git a/docs/ml-features.md b/docs/ml-features.md
index 65b60be..660c272 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1793,6 +1793,146 @@ for more details on the API.
 </div>
 </div>
 
+## ANOVASelector
+
+`ANOVASelector` operates on categorical labels with continuous features. It 
uses the
+[one-way ANOVA 
F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems)
 to decide which
+features to choose.
+It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, 
`fdr`, `fwe`:
+* `numTopFeatures` chooses a fixed number of top features according to ANOVA 
F-test.
+* `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
+* `fpr` chooses all features whose p-values are below a threshold, thus 
controlling the false positive rate of selection.
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values are below a threshold. The 
threshold is scaled by 1/numFeatures, thus controlling the family-wise error 
rate of selection.
+By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
+The user can choose a selection method using `setSelectorType`.
+
+**Examples**
+
+Assume that we have a DataFrame with the columns `id`, `features`, and 
`label`, which is used as
+our target to be predicted:
+
+~~~
+id | features                       | label
+---|--------------------------------|---------
+ 1 | [1.7, 4.4, 7.6, 5.8, 9.6, 2.3] | 3.0
+ 2 | [8.8, 7.3, 5.7, 7.3, 2.2, 4.1] | 2.0
+ 3 | [1.2, 9.5, 2.5, 3.1, 8.7, 2.5] | 3.0
+ 4 | [3.7, 9.2, 6.1, 4.1, 7.5, 3.8] | 2.0
+ 5 | [8.9, 5.2, 7.8, 8.3, 5.2, 3.0] | 4.0
+ 6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0
+~~~
+
+If we use `ANOVASelector` with `numTopFeatures = 1`, the
+last column in our `features` is chosen as the most useful feature:
+
+~~~
+id | features                       | label   | selectedFeatures
+---|--------------------------------|---------|------------------
+ 1 | [1.7, 4.4, 7.6, 5.8, 9.6, 2.3] | 3.0     | [2.3]
+ 2 | [8.8, 7.3, 5.7, 7.3, 2.2, 4.1] | 2.0     | [4.1]
+ 3 | [1.2, 9.5, 2.5, 3.1, 8.7, 2.5] | 3.0     | [2.5]
+ 4 | [3.7, 9.2, 6.1, 4.1, 7.5, 3.8] | 2.0     | [3.8]
+ 5 | [8.9, 5.2, 7.8, 8.3, 5.2, 3.0] | 4.0     | [3.0]
+ 6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0     | [2.1]
+~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [ANOVASelector Scala 
docs](api/scala/org/apache/spark/ml/feature/ANOVASelector.html)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [ANOVASelector Java 
docs](api/java/org/apache/spark/ml/feature/ANOVASelector.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [ANOVASelector Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.ANOVASelector)
+for more details on the API.
+
+{% include_example python/ml/anova_selector_example.py %}
+</div>
+</div>
+
+## FValueSelector
+
+`FValueSelector` operates on categorical labels with continuous features. It 
uses the
+[F-test for 
regression](https://en.wikipedia.org/wiki/F-test#Regression_problems) to decide 
which
+features to choose.
+It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, 
`fdr`, `fwe`:
+* `numTopFeatures` chooses a fixed number of top features according to a 
F-test for regression.
+* `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
+* `fpr` chooses all features whose p-values are below a threshold, thus 
controlling the false positive rate of selection.
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values are below a threshold. The 
threshold is scaled by 1/numFeatures, thus controlling the family-wise error 
rate of selection.
+By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
+The user can choose a selection method using `setSelectorType`.
+
+**Examples**
+
+Assume that we have a DataFrame with the columns `id`, `features`, and 
`label`, which is used as
+our target to be predicted:
+
+~~~
+id | features                       | label
+---|--------------------------------|---------
+ 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6
+ 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6
+ 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1
+ 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6
+ 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0
+ 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0
+~~~
+
+If we use `FValueSelector` with `numTopFeatures = 1`, the
+3rd column in our `features` is chosen as the most useful feature:
+
+~~~
+id | features                       | label   | selectedFeatures
+---|--------------------------------|---------|------------------
+ 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6     | [0.0]
+ 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6     | [6.0]
+ 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1     | [3.0]
+ 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6     | [8.0]
+ 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0     | [6.0]
+ 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0     | [6.0]
+~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [FValueSelector Scala 
docs](api/scala/org/apache/spark/ml/feature/FValueSelector.html)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/FValueSelectorExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [FValueSelector Java 
docs](api/java/org/apache/spark/ml/feature/FValueSelector.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [FValueSelector Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.FValueSelector)
+for more details on the API.
+
+{% include_example python/ml/anova_selector_example.py %}
+</div>
+</div>
+
 ## VarianceThresholdSelector
 
 `VarianceThresholdSelector` is a selector that removes low-variance features. 
Features with a
diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md
index a3d57ff..637cdd6 100644
--- a/docs/ml-statistics.md
+++ b/docs/ml-statistics.md
@@ -79,7 +79,35 @@ The output will be a DataFrame that contains the correlation 
matrix of the colum
 
 Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically
 significant, whether this result occurred by chance or not. `spark.ml` 
currently supports Pearson's
-Chi-squared ( $\chi^2$) tests for independence.
+Chi-squared ( $\chi^2$) tests for independence, as well as ANOVA test for 
classification tasks and
+F-value test for regression tasks.
+
+### ANOVATest
+
+`ANOVATest` computes ANOVA F-values between labels and features for 
classification tasks. The labels should be categorical
+and features should be continuous.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+Refer to the [`ANOVATest` Scala 
docs](api/scala/org/apache/spark/ml/stat/ANOVATest$.html) for details on the 
API.
+
+{% include_example scala/org/apache/spark/examples/ml/ANOVATestExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+Refer to the [`ANOVATest` Java 
docs](api/java/org/apache/spark/ml/stat/ANOVATest.html) for details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaANOVATestExample.java 
%}
+</div>
+
+<div data-lang="python" markdown="1">
+Refer to the [`ANOVATest` Python 
docs](api/python/index.html#pyspark.ml.stat.ANOVATest$) for details on the API.
+
+{% include_example python/ml/anova_test_example.py %}
+</div>
+</div>
+
+### ChiSquareTest
 
 `ChiSquareTest` conducts Pearson's independence test for every feature against 
the label.
 For each feature, the (feature, label) pairs are converted into a contingency 
matrix for which
@@ -106,6 +134,32 @@ Refer to the [`ChiSquareTest` Python 
docs](api/python/index.html#pyspark.ml.stat
 
 </div>
 
+### FValueTest
+
+`FValueTest` computes F-values between labels and features for regression 
tasks. Both the labels
+ and features should be continuous.
+
+ <div class="codetabs">
+ <div data-lang="scala" markdown="1">
+ Refer to the [`FValueTest` Scala 
docs](api/scala/org/apache/spark/ml/stat/FValueTest$.html) for details on the 
API.
+
+ {% include_example scala/org/apache/spark/examples/ml/FValueTestExample.scala 
%}
+ </div>
+
+ <div data-lang="java" markdown="1">
+ Refer to the [`FValueTest` Java 
docs](api/java/org/apache/spark/ml/stat/FValueTest.html) for details on the API.
+
+ {% include_example 
java/org/apache/spark/examples/ml/JavaFValueTestExample.java %}
+ </div>
+
+ <div data-lang="python" markdown="1">
+ Refer to the [`FValueTest` Python 
docs](api/python/index.html#pyspark.ml.stat.FValueTest$) for details on the API.
+
+ {% include_example python/ml/fvalue_test_example.py %}
+ </div>
+
+ </div>
+
 ## Summarizer
 
 We provide vector column summary statistics for `Dataframe` through 
`Summarizer`.
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java
similarity index 60%
copy from 
examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
copy to 
examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java
index 3b2de1f..6f24b45 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java
@@ -17,59 +17,65 @@
 
 package org.apache.spark.examples.ml;
 
+import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.SparkSession;
 
 // $example on$
 import java.util.Arrays;
 import java.util.List;
 
-import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.ml.feature.ANOVASelector;
 import org.apache.spark.ml.linalg.VectorUDT;
-import org.apache.spark.ml.stat.ANOVATest;
-import org.apache.spark.sql.Dataset;
+import org.apache.spark.ml.linalg.Vectors;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.RowFactory;
 import org.apache.spark.sql.types.*;
 // $example off$
 
 /**
- * An example for ANOVA testing.
+ * An example for ANOVASelector.
  * Run with
  * <pre>
- * bin/run-example ml.JavaANOVATestExample
+ * bin/run-example ml.JavaANOVASelectorExample
  * </pre>
  */
-public class JavaANOVATestExample {
-
+public class JavaANOVASelectorExample {
   public static void main(String[] args) {
     SparkSession spark = SparkSession
       .builder()
-      .appName("JavaANOVATestExample")
+      .appName("JavaANOVASelectorExample")
       .getOrCreate();
 
     // $example on$
     List<Row> data = Arrays.asList(
-      RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
-      RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
-      RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
-      RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
-      RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
+      RowFactory.create(1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0),
+      RowFactory.create(2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0),
+      RowFactory.create(3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0),
+      RowFactory.create(4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0),
+      RowFactory.create(5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0),
+      RowFactory.create(6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0)
     );
-
     StructType schema = new StructType(new StructField[]{
-      new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
+      new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
       new StructField("features", new VectorUDT(), false, Metadata.empty()),
+      new StructField("label", DataTypes.DoubleType, false, Metadata.empty())
     });
 
     Dataset<Row> df = spark.createDataFrame(data, schema);
-    Row r = ANOVATest.test(df, "features", "label").head();
-    System.out.println("pValues: " + r.get(0).toString());
-    System.out.println("degreesOfFreedom: " + r.getList(1).toString());
-    System.out.println("fValues: " + r.get(2).toString());
 
-    // $example off$
+    ANOVASelector selector = new ANOVASelector()
+      .setNumTopFeatures(1)
+      .setFeaturesCol("features")
+      .setLabelCol("label")
+      .setOutputCol("selectedFeatures");
+
+    Dataset<Row> result = selector.fit(df).transform(df);
 
+    System.out.println("ANOVASelector output with top " + 
selector.getNumTopFeatures()
+        + " features selected");
+    result.show();
+
+    // $example off$
     spark.stop();
   }
 }
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
index 3b2de1f..4785dbd 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
@@ -51,7 +51,7 @@ public class JavaANOVATestExample {
     List<Row> data = Arrays.asList(
       RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
       RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
+      RowFactory.create(3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
       RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
       RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
       RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java
similarity index 59%
copy from 
examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
copy to 
examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java
index 3b2de1f..e8253ff 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java
@@ -17,59 +17,65 @@
 
 package org.apache.spark.examples.ml;
 
+import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.SparkSession;
 
 // $example on$
 import java.util.Arrays;
 import java.util.List;
 
-import org.apache.spark.ml.linalg.Vectors;
+import org.apache.spark.ml.feature.FValueSelector;
 import org.apache.spark.ml.linalg.VectorUDT;
-import org.apache.spark.ml.stat.ANOVATest;
-import org.apache.spark.sql.Dataset;
+import org.apache.spark.ml.linalg.Vectors;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.RowFactory;
 import org.apache.spark.sql.types.*;
 // $example off$
 
 /**
- * An example for ANOVA testing.
+ * An example demonstrating FValueSelector.
  * Run with
  * <pre>
- * bin/run-example ml.JavaANOVATestExample
+ * bin/run-example ml.JavaFValueSelectorExample
  * </pre>
  */
-public class JavaANOVATestExample {
-
+public class JavaFValueSelectorExample {
   public static void main(String[] args) {
     SparkSession spark = SparkSession
       .builder()
-      .appName("JavaANOVATestExample")
+      .appName("JavaFValueSelectorExample")
       .getOrCreate();
 
     // $example on$
     List<Row> data = Arrays.asList(
-      RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
-      RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
-      RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
-      RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
-      RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
+      RowFactory.create(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6),
+      RowFactory.create(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6),
+      RowFactory.create(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1),
+      RowFactory.create(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6),
+      RowFactory.create(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0),
+      RowFactory.create(6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0)
     );
-
     StructType schema = new StructType(new StructField[]{
-      new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),
+      new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
       new StructField("features", new VectorUDT(), false, Metadata.empty()),
+      new StructField("label", DataTypes.DoubleType, false, Metadata.empty())
     });
 
     Dataset<Row> df = spark.createDataFrame(data, schema);
-    Row r = ANOVATest.test(df, "features", "label").head();
-    System.out.println("pValues: " + r.get(0).toString());
-    System.out.println("degreesOfFreedom: " + r.getList(1).toString());
-    System.out.println("fValues: " + r.get(2).toString());
 
-    // $example off$
+    FValueSelector selector = new FValueSelector()
+      .setNumTopFeatures(1)
+      .setFeaturesCol("features")
+      .setLabelCol("label")
+      .setOutputCol("selectedFeatures");
+
+    Dataset<Row> result = selector.fit(df).transform(df);
 
+    System.out.println("FValueSelector output with top " + 
selector.getNumTopFeatures()
+        + " features selected");
+    result.show();
+
+    // $example off$
     spark.stop();
   }
 }
diff --git 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
index 11861ac..cda28db 100644
--- 
a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
+++ 
b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java
@@ -66,7 +66,7 @@ public class JavaFValueTestExample {
     Row r = FValueTest.test(df, "features", "label").head();
     System.out.println("pValues: " + r.get(0).toString());
     System.out.println("degreesOfFreedom: " + r.getList(1).toString());
-    System.out.println("fvalue: " + r.get(2).toString());
+    System.out.println("fvalues: " + r.get(2).toString());
 
     // $example off$
 
diff --git a/examples/src/main/python/ml/anova_test_example.py 
b/examples/src/main/python/ml/anova_selector_example.py
similarity index 54%
copy from examples/src/main/python/ml/anova_test_example.py
copy to examples/src/main/python/ml/anova_selector_example.py
index 3fffdbd..f8458f5 100644
--- a/examples/src/main/python/ml/anova_test_example.py
+++ b/examples/src/main/python/ml/anova_selector_example.py
@@ -16,37 +16,40 @@
 #
 
 """
-An example for ANOVA testing.
+An example for ANOVASelector.
 Run with:
-  bin/spark-submit examples/src/main/python/ml/anova_test_example.py
+  bin/spark-submit examples/src/main/python/ml/anova_selector_example.py
 """
 from __future__ import print_function
 
 from pyspark.sql import SparkSession
 # $example on$
+from pyspark.ml.feature import ANOVASelector
 from pyspark.ml.linalg import Vectors
-from pyspark.ml.stat import ANOVATest
 # $example off$
 
 if __name__ == "__main__":
     spark = SparkSession\
         .builder\
-        .appName("ANOVATestExample")\
+        .appName("ANOVASelectorExample")\
         .getOrCreate()
 
     # $example on$
-    data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])),
-            (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])),
-            (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])),
-            (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])),
-            (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])),
-            (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))]
-    df = spark.createDataFrame(data, ["label", "features"])
-
-    r = ANOVATest.test(df, "features", "label").head()
-    print("pValues: " + str(r.pValues))
-    print("degreesOfFreedom: " + str(r.degreesOfFreedom))
-    print("fValues: " + str(r.fValues))
+    df = spark.createDataFrame([
+        (1, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0,),
+        (2, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0,),
+        (3, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 3.0,),
+        (4, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0,),
+        (5, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0,),
+        (6, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0,)], ["id", 
"features", "label"])
+
+    selector = ANOVASelector(numTopFeatures=1, featuresCol="features",
+                             outputCol="selectedFeatures", labelCol="label")
+
+    result = selector.fit(df).transform(df)
+
+    print("ANOVASelector output with top %d features selected" % 
selector.getNumTopFeatures())
+    result.show()
     # $example off$
 
     spark.stop()
diff --git a/examples/src/main/python/ml/anova_test_example.py 
b/examples/src/main/python/ml/anova_test_example.py
index 3fffdbd..4119441 100644
--- a/examples/src/main/python/ml/anova_test_example.py
+++ b/examples/src/main/python/ml/anova_test_example.py
@@ -37,7 +37,7 @@ if __name__ == "__main__":
     # $example on$
     data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])),
             (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])),
-            (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])),
+            (3.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])),
             (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])),
             (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])),
             (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))]
diff --git a/examples/src/main/python/ml/anova_test_example.py 
b/examples/src/main/python/ml/fvalue_selector_example.py
similarity index 53%
copy from examples/src/main/python/ml/anova_test_example.py
copy to examples/src/main/python/ml/fvalue_selector_example.py
index 3fffdbd..3158953a 100644
--- a/examples/src/main/python/ml/anova_test_example.py
+++ b/examples/src/main/python/ml/fvalue_selector_example.py
@@ -16,37 +16,40 @@
 #
 
 """
-An example for ANOVA testing.
+An example for FValueSelector.
 Run with:
-  bin/spark-submit examples/src/main/python/ml/anova_test_example.py
+  bin/spark-submit examples/src/main/python/ml/fvalue_selector_example.py
 """
 from __future__ import print_function
 
 from pyspark.sql import SparkSession
 # $example on$
+from pyspark.ml.feature import FValueSelector
 from pyspark.ml.linalg import Vectors
-from pyspark.ml.stat import ANOVATest
 # $example off$
 
 if __name__ == "__main__":
     spark = SparkSession\
         .builder\
-        .appName("ANOVATestExample")\
+        .appName("FValueSelectorExample")\
         .getOrCreate()
 
     # $example on$
-    data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])),
-            (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])),
-            (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])),
-            (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])),
-            (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])),
-            (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))]
-    df = spark.createDataFrame(data, ["label", "features"])
-
-    r = ANOVATest.test(df, "features", "label").head()
-    print("pValues: " + str(r.pValues))
-    print("degreesOfFreedom: " + str(r.degreesOfFreedom))
-    print("fValues: " + str(r.fValues))
+    df = spark.createDataFrame([
+        (1, Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]), 4.6,),
+        (2, Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]), 6.6,),
+        (3, Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]), 5.1,),
+        (4, Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]), 7.6,),
+        (5, Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]), 9.0,),
+        (6, Vectors.dense([8.0, 9.0, 6.0, 4.0, 0.0, 0.0]), 9.0,)], ["id", 
"features", "label"])
+
+    selector = FValueSelector(numTopFeatures=1, featuresCol="features",
+                              outputCol="selectedFeatures", labelCol="label")
+
+    result = selector.fit(df).transform(df)
+
+    print("FValueSelector output with top %d features selected" % 
selector.getNumTopFeatures())
+    result.show()
     # $example off$
 
     spark.stop()
diff --git a/examples/src/main/python/ml/fvalue_test_example.py 
b/examples/src/main/python/ml/fvalue_test_example.py
index 4a97bcd..410b39e 100644
--- a/examples/src/main/python/ml/fvalue_test_example.py
+++ b/examples/src/main/python/ml/fvalue_test_example.py
@@ -46,7 +46,7 @@ if __name__ == "__main__":
     ftest = FValueTest.test(df, "features", "label").head()
     print("pValues: " + str(ftest.pValues))
     print("degreesOfFreedom: " + str(ftest.degreesOfFreedom))
-    print("fvalue: " + str(ftest.fValues))
+    print("fvalues: " + str(ftest.fValues))
     # $example off$
 
     spark.stop()
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala
similarity index 55%
copy from 
examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
copy to 
examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala
index 0cd793f..46803cc 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala
@@ -19,42 +19,48 @@
 package org.apache.spark.examples.ml
 
 // $example on$
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.stat.ANOVATest
+import org.apache.spark.ml.feature.ANOVASelector
+import org.apache.spark.ml.linalg.Vectors
 // $example off$
 import org.apache.spark.sql.SparkSession
 
 /**
- * An example for ANOVA testing.
+ * An example for ANOVASelector.
  * Run with
  * {{{
- * bin/run-example ml.ANOVATestExample
+ * bin/run-example ml.ANOVASelectorExample
  * }}}
  */
-object ANOVATestExample {
-
+object ANOVASelectorExample {
   def main(args: Array[String]): Unit = {
     val spark = SparkSession
       .builder
-      .appName("ANOVATestExample")
+      .appName("ANOVASelectorExample")
       .getOrCreate()
     import spark.implicits._
 
     // $example on$
     val data = Seq(
-      (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
-      (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
-      (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
-      (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
-      (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
+      (1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0),
+      (2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0),
+      (3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0),
+      (4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0),
+      (5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0),
+      (6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0)
     )
 
-    val df = data.toDF("label", "features")
-    val anova = ANOVATest.test(df, "features", "label").head
-    println(s"pValues = ${anova.getAs[Vector](0)}")
-    println(s"degreesOfFreedom ${anova.getSeq[Int](1).mkString("[", ",", 
"]")}")
-    println(s"fValues ${anova.getAs[Vector](2)}")
+    val df = spark.createDataset(data).toDF("id", "features", "label")
+
+    val selector = new ANOVASelector()
+      .setNumTopFeatures(1)
+      .setFeaturesCol("features")
+      .setLabelCol("label")
+      .setOutputCol("selectedFeatures")
+
+    val result = selector.fit(df).transform(df)
+
+    println(s"ANOVASelector output with top ${selector.getNumTopFeatures} 
features selected")
+    result.show()
     // $example off$
 
     spark.stop()
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
index 0cd793f..f0b9f23 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
@@ -44,7 +44,7 @@ object ANOVATestExample {
     val data = Seq(
       (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
       (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
+      (3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
       (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
       (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
       (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala
similarity index 54%
copy from 
examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
copy to 
examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala
index 0cd793f..914d81b 100644
--- 
a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala
+++ 
b/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala
@@ -19,42 +19,48 @@
 package org.apache.spark.examples.ml
 
 // $example on$
-import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.stat.ANOVATest
+import org.apache.spark.ml.feature.FValueSelector
+import org.apache.spark.ml.linalg.Vectors
 // $example off$
 import org.apache.spark.sql.SparkSession
 
 /**
- * An example for ANOVA testing.
+ * An example for FValueSelector.
  * Run with
  * {{{
- * bin/run-example ml.ANOVATestExample
+ * bin/run-example ml.FValueSelectorExample
  * }}}
  */
-object ANOVATestExample {
-
+object FValueSelectorExample {
   def main(args: Array[String]): Unit = {
     val spark = SparkSession
       .builder
-      .appName("ANOVATestExample")
+      .appName("FValueSelectorExample")
       .getOrCreate()
     import spark.implicits._
 
     // $example on$
     val data = Seq(
-      (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)),
-      (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)),
-      (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)),
-      (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)),
-      (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)),
-      (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1))
+      (1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6),
+      (2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6),
+      (3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1),
+      (4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6),
+      (5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0),
+      (6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0)
     )
 
-    val df = data.toDF("label", "features")
-    val anova = ANOVATest.test(df, "features", "label").head
-    println(s"pValues = ${anova.getAs[Vector](0)}")
-    println(s"degreesOfFreedom ${anova.getSeq[Int](1).mkString("[", ",", 
"]")}")
-    println(s"fValues ${anova.getAs[Vector](2)}")
+    val df = spark.createDataset(data).toDF("id", "features", "label")
+
+    val selector = new FValueSelector()
+      .setNumTopFeatures(1)
+      .setFeaturesCol("features")
+      .setLabelCol("label")
+      .setOutputCol("selectedFeatures")
+
+    val result = selector.fit(df).transform(df)
+
+    println(s"FValueSelector output with top ${selector.getNumTopFeatures} 
features selected")
+    result.show()
     // $example off$
 
     spark.stop()
diff --git 
a/examples/src/main/scala/org/apache/spark/examples/ml/FVlaueTestExample.scala 
b/examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala
similarity index 100%
rename from 
examples/src/main/scala/org/apache/spark/examples/ml/FVlaueTestExample.scala
rename to 
examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector

Reply via email to