This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 194ac3b [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector 194ac3b is described below commit 194ac3be8bd8ca1b5e463074ed61420f185e8caf Author: Huaxin Gao <huax...@us.ibm.com> AuthorDate: Fri May 15 09:59:14 2020 -0500 [SPARK-31708][ML][DOCS] Add docs and examples for ANOVASelector and FValueSelector ### What changes were proposed in this pull request? Add docs and examples for ANOVASelector and FValueSelector ### Why are the changes needed? Complete the implementation of ANOVASelector and FValueSelector ### Does this PR introduce _any_ user-facing change? Yes <img width="850" alt="Screen Shot 2020-05-13 at 5 17 44 PM" src="https://user-images.githubusercontent.com/13592258/81878703-b4f94480-953d-11ea-9166-da3c64852b90.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 05 15 PM" src="https://user-images.githubusercontent.com/13592258/81878600-6055c980-953d-11ea-8b24-09c31647139b.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 06 06 PM" src="https://user-images.githubusercontent.com/13592258/81878603-621f8d00-953d-11ea-9447-39913ccc067d.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 06 21 PM" src="https://user-images.githubusercontent.com/13592258/81878606-65b31400-953d-11ea-9d76-51859266d1a8.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 10 PM" src="https://user-images.githubusercontent.com/13592258/81878611-69df3180-953d-11ea-8618-23a2a6cfd730.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 33 PM" src="https://user-images.githubusercontent.com/13592258/81878620-6cda2200-953d-11ea-9c46-da763328364e.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 07 47 PM" src="https://user-images.githubusercontent.com/13592258/81878625-6f3c7c00-953d-11ea-9d11-2281b33a0bd8.png"> <img width="851" alt="Screen Shot 2020-05-13 at 5 19 35 PM" src="https://user-images.githubusercontent.com/13592258/81878882-13bebe00-953e-11ea-9776-288bac97d93f.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 08 42 PM" src="https://user-images.githubusercontent.com/13592258/81878637-76638a00-953d-11ea-94b0-dc9bc85ae2b7.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 09 01 PM" src="https://user-images.githubusercontent.com/13592258/81878640-79f71100-953d-11ea-9a66-b27f9482fbd3.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 09 50 PM" src="https://user-images.githubusercontent.com/13592258/81878644-7cf20180-953d-11ea-9142-9658c8e90986.png"> <img width="851" alt="Screen Shot 2020-05-13 at 5 10 06 PM" src="https://user-images.githubusercontent.com/13592258/81878653-81b6b580-953d-11ea-9dc2-8015095cf569.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 10 59 PM" src="https://user-images.githubusercontent.com/13592258/81878658-854a3c80-953d-11ea-8dc9-217aa749fd00.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 11 27 PM" src="https://user-images.githubusercontent.com/13592258/81878659-87ac9680-953d-11ea-8c6b-74ab76748e4a.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 14 54 PM" src="https://user-images.githubusercontent.com/13592258/81878664-8b401d80-953d-11ea-9ee1-05f6677e263c.png"> <img width="850" alt="Screen Shot 2020-05-13 at 5 15 17 PM" src="https://user-images.githubusercontent.com/13592258/81878669-8da27780-953d-11ea-8216-77eb8bb7e091.png"> ### How was this patch tested? Manually build and check Closes #28524 from huaxingao/examples. Authored-by: Huaxin Gao <huax...@us.ibm.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- docs/ml-features.md | 140 +++++++++++++++++++++ docs/ml-statistics.md | 56 ++++++++- ...tExample.java => JavaANOVASelectorExample.java} | 48 +++---- .../spark/examples/ml/JavaANOVATestExample.java | 2 +- ...Example.java => JavaFValueSelectorExample.java} | 48 +++---- .../spark/examples/ml/JavaFValueTestExample.java | 2 +- ...a_test_example.py => anova_selector_example.py} | 35 +++--- examples/src/main/python/ml/anova_test_example.py | 2 +- ..._test_example.py => fvalue_selector_example.py} | 35 +++--- examples/src/main/python/ml/fvalue_test_example.py | 2 +- ...estExample.scala => ANOVASelectorExample.scala} | 42 ++++--- .../spark/examples/ml/ANOVATestExample.scala | 2 +- ...stExample.scala => FValueSelectorExample.scala} | 42 ++++--- ...ueTestExample.scala => FValueTestExample.scala} | 0 14 files changed, 340 insertions(+), 116 deletions(-) diff --git a/docs/ml-features.md b/docs/ml-features.md index 65b60be..660c272 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1793,6 +1793,146 @@ for more details on the API. </div> </div> +## ANOVASelector + +`ANOVASelector` operates on categorical labels with continuous features. It uses the +[one-way ANOVA F-test](https://en.wikipedia.org/wiki/F-test#Multiple-comparison_ANOVA_problems) to decide which +features to choose. +It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: +* `numTopFeatures` chooses a fixed number of top features according to ANOVA F-test. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. +* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. +* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. +By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. +The user can choose a selection method using `setSelectorType`. + +**Examples** + +Assume that we have a DataFrame with the columns `id`, `features`, and `label`, which is used as +our target to be predicted: + +~~~ +id | features | label +---|--------------------------------|--------- + 1 | [1.7, 4.4, 7.6, 5.8, 9.6, 2.3] | 3.0 + 2 | [8.8, 7.3, 5.7, 7.3, 2.2, 4.1] | 2.0 + 3 | [1.2, 9.5, 2.5, 3.1, 8.7, 2.5] | 3.0 + 4 | [3.7, 9.2, 6.1, 4.1, 7.5, 3.8] | 2.0 + 5 | [8.9, 5.2, 7.8, 8.3, 5.2, 3.0] | 4.0 + 6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0 +~~~ + +If we use `ANOVASelector` with `numTopFeatures = 1`, the +last column in our `features` is chosen as the most useful feature: + +~~~ +id | features | label | selectedFeatures +---|--------------------------------|---------|------------------ + 1 | [1.7, 4.4, 7.6, 5.8, 9.6, 2.3] | 3.0 | [2.3] + 2 | [8.8, 7.3, 5.7, 7.3, 2.2, 4.1] | 2.0 | [4.1] + 3 | [1.2, 9.5, 2.5, 3.1, 8.7, 2.5] | 3.0 | [2.5] + 4 | [3.7, 9.2, 6.1, 4.1, 7.5, 3.8] | 2.0 | [3.8] + 5 | [8.9, 5.2, 7.8, 8.3, 5.2, 3.0] | 4.0 | [3.0] + 6 | [7.9, 8.5, 9.2, 4.0, 9.4, 2.1] | 4.0 | [2.1] +~~~ + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +Refer to the [ANOVASelector Scala docs](api/scala/org/apache/spark/ml/feature/ANOVASelector.html) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala %} +</div> + +<div data-lang="java" markdown="1"> + +Refer to the [ANOVASelector Java docs](api/java/org/apache/spark/ml/feature/ANOVASelector.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +Refer to the [ANOVASelector Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ANOVASelector) +for more details on the API. + +{% include_example python/ml/anova_selector_example.py %} +</div> +</div> + +## FValueSelector + +`FValueSelector` operates on categorical labels with continuous features. It uses the +[F-test for regression](https://en.wikipedia.org/wiki/F-test#Regression_problems) to decide which +features to choose. +It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: +* `numTopFeatures` chooses a fixed number of top features according to a F-test for regression. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. +* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. +* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. +By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. +The user can choose a selection method using `setSelectorType`. + +**Examples** + +Assume that we have a DataFrame with the columns `id`, `features`, and `label`, which is used as +our target to be predicted: + +~~~ +id | features | label +---|--------------------------------|--------- + 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6 + 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6 + 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1 + 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6 + 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0 + 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0 +~~~ + +If we use `FValueSelector` with `numTopFeatures = 1`, the +3rd column in our `features` is chosen as the most useful feature: + +~~~ +id | features | label | selectedFeatures +---|--------------------------------|---------|------------------ + 1 | [6.0, 7.0, 0.0, 7.0, 6.0, 0.0] | 4.6 | [0.0] + 2 | [0.0, 9.0, 6.0, 0.0, 5.0, 9.0] | 6.6 | [6.0] + 3 | [0.0, 9.0, 3.0, 0.0, 5.0, 5.0] | 5.1 | [3.0] + 4 | [0.0, 9.0, 8.0, 5.0, 6.0, 4.0] | 7.6 | [8.0] + 5 | [8.0, 9.0, 6.0, 5.0, 4.0, 4.0] | 9.0 | [6.0] + 6 | [8.0, 9.0, 6.0, 4.0, 0.0, 0.0] | 9.0 | [6.0] +~~~ + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +Refer to the [FValueSelector Scala docs](api/scala/org/apache/spark/ml/feature/FValueSelector.html) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/FValueSelectorExample.scala %} +</div> + +<div data-lang="java" markdown="1"> + +Refer to the [FValueSelector Java docs](api/java/org/apache/spark/ml/feature/FValueSelector.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +Refer to the [FValueSelector Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.FValueSelector) +for more details on the API. + +{% include_example python/ml/anova_selector_example.py %} +</div> +</div> + ## VarianceThresholdSelector `VarianceThresholdSelector` is a selector that removes low-variance features. Features with a diff --git a/docs/ml-statistics.md b/docs/ml-statistics.md index a3d57ff..637cdd6 100644 --- a/docs/ml-statistics.md +++ b/docs/ml-statistics.md @@ -79,7 +79,35 @@ The output will be a DataFrame that contains the correlation matrix of the colum Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. `spark.ml` currently supports Pearson's -Chi-squared ( $\chi^2$) tests for independence. +Chi-squared ( $\chi^2$) tests for independence, as well as ANOVA test for classification tasks and +F-value test for regression tasks. + +### ANOVATest + +`ANOVATest` computes ANOVA F-values between labels and features for classification tasks. The labels should be categorical +and features should be continuous. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +Refer to the [`ANOVATest` Scala docs](api/scala/org/apache/spark/ml/stat/ANOVATest$.html) for details on the API. + +{% include_example scala/org/apache/spark/examples/ml/ANOVATestExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +Refer to the [`ANOVATest` Java docs](api/java/org/apache/spark/ml/stat/ANOVATest.html) for details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaANOVATestExample.java %} +</div> + +<div data-lang="python" markdown="1"> +Refer to the [`ANOVATest` Python docs](api/python/index.html#pyspark.ml.stat.ANOVATest$) for details on the API. + +{% include_example python/ml/anova_test_example.py %} +</div> +</div> + +### ChiSquareTest `ChiSquareTest` conducts Pearson's independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which @@ -106,6 +134,32 @@ Refer to the [`ChiSquareTest` Python docs](api/python/index.html#pyspark.ml.stat </div> +### FValueTest + +`FValueTest` computes F-values between labels and features for regression tasks. Both the labels + and features should be continuous. + + <div class="codetabs"> + <div data-lang="scala" markdown="1"> + Refer to the [`FValueTest` Scala docs](api/scala/org/apache/spark/ml/stat/FValueTest$.html) for details on the API. + + {% include_example scala/org/apache/spark/examples/ml/FValueTestExample.scala %} + </div> + + <div data-lang="java" markdown="1"> + Refer to the [`FValueTest` Java docs](api/java/org/apache/spark/ml/stat/FValueTest.html) for details on the API. + + {% include_example java/org/apache/spark/examples/ml/JavaFValueTestExample.java %} + </div> + + <div data-lang="python" markdown="1"> + Refer to the [`FValueTest` Python docs](api/python/index.html#pyspark.ml.stat.FValueTest$) for details on the API. + + {% include_example python/ml/fvalue_test_example.py %} + </div> + + </div> + ## Summarizer We provide vector column summary statistics for `Dataframe` through `Summarizer`. diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java similarity index 60% copy from examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java copy to examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java index 3b2de1f..6f24b45 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVASelectorExample.java @@ -17,59 +17,65 @@ package org.apache.spark.examples.ml; +import org.apache.spark.sql.Dataset; import org.apache.spark.sql.SparkSession; // $example on$ import java.util.Arrays; import java.util.List; -import org.apache.spark.ml.linalg.Vectors; +import org.apache.spark.ml.feature.ANOVASelector; import org.apache.spark.ml.linalg.VectorUDT; -import org.apache.spark.ml.stat.ANOVATest; -import org.apache.spark.sql.Dataset; +import org.apache.spark.ml.linalg.Vectors; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.*; // $example off$ /** - * An example for ANOVA testing. + * An example for ANOVASelector. * Run with * <pre> - * bin/run-example ml.JavaANOVATestExample + * bin/run-example ml.JavaANOVASelectorExample * </pre> */ -public class JavaANOVATestExample { - +public class JavaANOVASelectorExample { public static void main(String[] args) { SparkSession spark = SparkSession .builder() - .appName("JavaANOVATestExample") + .appName("JavaANOVASelectorExample") .getOrCreate(); // $example on$ List<Row> data = Arrays.asList( - RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), - RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), - RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), - RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), - RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) + RowFactory.create(1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0), + RowFactory.create(2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0), + RowFactory.create(3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0), + RowFactory.create(4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0), + RowFactory.create(5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0), + RowFactory.create(6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0) ); - StructType schema = new StructType(new StructField[]{ - new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), + new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()), + new StructField("label", DataTypes.DoubleType, false, Metadata.empty()) }); Dataset<Row> df = spark.createDataFrame(data, schema); - Row r = ANOVATest.test(df, "features", "label").head(); - System.out.println("pValues: " + r.get(0).toString()); - System.out.println("degreesOfFreedom: " + r.getList(1).toString()); - System.out.println("fValues: " + r.get(2).toString()); - // $example off$ + ANOVASelector selector = new ANOVASelector() + .setNumTopFeatures(1) + .setFeaturesCol("features") + .setLabelCol("label") + .setOutputCol("selectedFeatures"); + + Dataset<Row> result = selector.fit(df).transform(df); + System.out.println("ANOVASelector output with top " + selector.getNumTopFeatures() + + " features selected"); + result.show(); + + // $example off$ spark.stop(); } } diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java index 3b2de1f..4785dbd 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java @@ -51,7 +51,7 @@ public class JavaANOVATestExample { List<Row> data = Arrays.asList( RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), + RowFactory.create(3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java similarity index 59% copy from examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java copy to examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java index 3b2de1f..e8253ff 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaANOVATestExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueSelectorExample.java @@ -17,59 +17,65 @@ package org.apache.spark.examples.ml; +import org.apache.spark.sql.Dataset; import org.apache.spark.sql.SparkSession; // $example on$ import java.util.Arrays; import java.util.List; -import org.apache.spark.ml.linalg.Vectors; +import org.apache.spark.ml.feature.FValueSelector; import org.apache.spark.ml.linalg.VectorUDT; -import org.apache.spark.ml.stat.ANOVATest; -import org.apache.spark.sql.Dataset; +import org.apache.spark.ml.linalg.Vectors; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.types.*; // $example off$ /** - * An example for ANOVA testing. + * An example demonstrating FValueSelector. * Run with * <pre> - * bin/run-example ml.JavaANOVATestExample + * bin/run-example ml.JavaFValueSelectorExample * </pre> */ -public class JavaANOVATestExample { - +public class JavaFValueSelectorExample { public static void main(String[] args) { SparkSession spark = SparkSession .builder() - .appName("JavaANOVATestExample") + .appName("JavaFValueSelectorExample") .getOrCreate(); // $example on$ List<Row> data = Arrays.asList( - RowFactory.create(3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), - RowFactory.create(2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - RowFactory.create(1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), - RowFactory.create(2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), - RowFactory.create(4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), - RowFactory.create(4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) + RowFactory.create(1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6), + RowFactory.create(2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6), + RowFactory.create(3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1), + RowFactory.create(4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6), + RowFactory.create(5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0), + RowFactory.create(6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0) ); - StructType schema = new StructType(new StructField[]{ - new StructField("label", DataTypes.DoubleType, false, Metadata.empty()), + new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), new StructField("features", new VectorUDT(), false, Metadata.empty()), + new StructField("label", DataTypes.DoubleType, false, Metadata.empty()) }); Dataset<Row> df = spark.createDataFrame(data, schema); - Row r = ANOVATest.test(df, "features", "label").head(); - System.out.println("pValues: " + r.get(0).toString()); - System.out.println("degreesOfFreedom: " + r.getList(1).toString()); - System.out.println("fValues: " + r.get(2).toString()); - // $example off$ + FValueSelector selector = new FValueSelector() + .setNumTopFeatures(1) + .setFeaturesCol("features") + .setLabelCol("label") + .setOutputCol("selectedFeatures"); + + Dataset<Row> result = selector.fit(df).transform(df); + System.out.println("FValueSelector output with top " + selector.getNumTopFeatures() + + " features selected"); + result.show(); + + // $example off$ spark.stop(); } } diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java index 11861ac..cda28db 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaFValueTestExample.java @@ -66,7 +66,7 @@ public class JavaFValueTestExample { Row r = FValueTest.test(df, "features", "label").head(); System.out.println("pValues: " + r.get(0).toString()); System.out.println("degreesOfFreedom: " + r.getList(1).toString()); - System.out.println("fvalue: " + r.get(2).toString()); + System.out.println("fvalues: " + r.get(2).toString()); // $example off$ diff --git a/examples/src/main/python/ml/anova_test_example.py b/examples/src/main/python/ml/anova_selector_example.py similarity index 54% copy from examples/src/main/python/ml/anova_test_example.py copy to examples/src/main/python/ml/anova_selector_example.py index 3fffdbd..f8458f5 100644 --- a/examples/src/main/python/ml/anova_test_example.py +++ b/examples/src/main/python/ml/anova_selector_example.py @@ -16,37 +16,40 @@ # """ -An example for ANOVA testing. +An example for ANOVASelector. Run with: - bin/spark-submit examples/src/main/python/ml/anova_test_example.py + bin/spark-submit examples/src/main/python/ml/anova_selector_example.py """ from __future__ import print_function from pyspark.sql import SparkSession # $example on$ +from pyspark.ml.feature import ANOVASelector from pyspark.ml.linalg import Vectors -from pyspark.ml.stat import ANOVATest # $example off$ if __name__ == "__main__": spark = SparkSession\ .builder\ - .appName("ANOVATestExample")\ + .appName("ANOVASelectorExample")\ .getOrCreate() # $example on$ - data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])), - (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])), - (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])), - (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])), - (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])), - (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))] - df = spark.createDataFrame(data, ["label", "features"]) - - r = ANOVATest.test(df, "features", "label").head() - print("pValues: " + str(r.pValues)) - print("degreesOfFreedom: " + str(r.degreesOfFreedom)) - print("fValues: " + str(r.fValues)) + df = spark.createDataFrame([ + (1, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3]), 3.0,), + (2, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1]), 2.0,), + (3, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5]), 3.0,), + (4, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8]), 2.0,), + (5, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0]), 4.0,), + (6, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]), 4.0,)], ["id", "features", "label"]) + + selector = ANOVASelector(numTopFeatures=1, featuresCol="features", + outputCol="selectedFeatures", labelCol="label") + + result = selector.fit(df).transform(df) + + print("ANOVASelector output with top %d features selected" % selector.getNumTopFeatures()) + result.show() # $example off$ spark.stop() diff --git a/examples/src/main/python/ml/anova_test_example.py b/examples/src/main/python/ml/anova_test_example.py index 3fffdbd..4119441 100644 --- a/examples/src/main/python/ml/anova_test_example.py +++ b/examples/src/main/python/ml/anova_test_example.py @@ -37,7 +37,7 @@ if __name__ == "__main__": # $example on$ data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])), (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])), - (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])), + (3.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])), (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])), (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])), (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))] diff --git a/examples/src/main/python/ml/anova_test_example.py b/examples/src/main/python/ml/fvalue_selector_example.py similarity index 53% copy from examples/src/main/python/ml/anova_test_example.py copy to examples/src/main/python/ml/fvalue_selector_example.py index 3fffdbd..3158953a 100644 --- a/examples/src/main/python/ml/anova_test_example.py +++ b/examples/src/main/python/ml/fvalue_selector_example.py @@ -16,37 +16,40 @@ # """ -An example for ANOVA testing. +An example for FValueSelector. Run with: - bin/spark-submit examples/src/main/python/ml/anova_test_example.py + bin/spark-submit examples/src/main/python/ml/fvalue_selector_example.py """ from __future__ import print_function from pyspark.sql import SparkSession # $example on$ +from pyspark.ml.feature import FValueSelector from pyspark.ml.linalg import Vectors -from pyspark.ml.stat import ANOVATest # $example off$ if __name__ == "__main__": spark = SparkSession\ .builder\ - .appName("ANOVATestExample")\ + .appName("FValueSelectorExample")\ .getOrCreate() # $example on$ - data = [(3.0, Vectors.dense([1.7, 4.4, 7.6, 5.8, 9.6, 2.3])), - (2.0, Vectors.dense([8.8, 7.3, 5.7, 7.3, 2.2, 4.1])), - (1.0, Vectors.dense([1.2, 9.5, 2.5, 3.1, 8.7, 2.5])), - (2.0, Vectors.dense([3.7, 9.2, 6.1, 4.1, 7.5, 3.8])), - (4.0, Vectors.dense([8.9, 5.2, 7.8, 8.3, 5.2, 3.0])), - (4.0, Vectors.dense([7.9, 8.5, 9.2, 4.0, 9.4, 2.1]))] - df = spark.createDataFrame(data, ["label", "features"]) - - r = ANOVATest.test(df, "features", "label").head() - print("pValues: " + str(r.pValues)) - print("degreesOfFreedom: " + str(r.degreesOfFreedom)) - print("fValues: " + str(r.fValues)) + df = spark.createDataFrame([ + (1, Vectors.dense([6.0, 7.0, 0.0, 7.0, 6.0, 0.0]), 4.6,), + (2, Vectors.dense([0.0, 9.0, 6.0, 0.0, 5.0, 9.0]), 6.6,), + (3, Vectors.dense([0.0, 9.0, 3.0, 0.0, 5.0, 5.0]), 5.1,), + (4, Vectors.dense([0.0, 9.0, 8.0, 5.0, 6.0, 4.0]), 7.6,), + (5, Vectors.dense([8.0, 9.0, 6.0, 5.0, 4.0, 4.0]), 9.0,), + (6, Vectors.dense([8.0, 9.0, 6.0, 4.0, 0.0, 0.0]), 9.0,)], ["id", "features", "label"]) + + selector = FValueSelector(numTopFeatures=1, featuresCol="features", + outputCol="selectedFeatures", labelCol="label") + + result = selector.fit(df).transform(df) + + print("FValueSelector output with top %d features selected" % selector.getNumTopFeatures()) + result.show() # $example off$ spark.stop() diff --git a/examples/src/main/python/ml/fvalue_test_example.py b/examples/src/main/python/ml/fvalue_test_example.py index 4a97bcd..410b39e 100644 --- a/examples/src/main/python/ml/fvalue_test_example.py +++ b/examples/src/main/python/ml/fvalue_test_example.py @@ -46,7 +46,7 @@ if __name__ == "__main__": ftest = FValueTest.test(df, "features", "label").head() print("pValues: " + str(ftest.pValues)) print("degreesOfFreedom: " + str(ftest.degreesOfFreedom)) - print("fvalue: " + str(ftest.fValues)) + print("fvalues: " + str(ftest.fValues)) # $example off$ spark.stop() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala similarity index 55% copy from examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala copy to examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala index 0cd793f..46803cc 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVASelectorExample.scala @@ -19,42 +19,48 @@ package org.apache.spark.examples.ml // $example on$ -import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.stat.ANOVATest +import org.apache.spark.ml.feature.ANOVASelector +import org.apache.spark.ml.linalg.Vectors // $example off$ import org.apache.spark.sql.SparkSession /** - * An example for ANOVA testing. + * An example for ANOVASelector. * Run with * {{{ - * bin/run-example ml.ANOVATestExample + * bin/run-example ml.ANOVASelectorExample * }}} */ -object ANOVATestExample { - +object ANOVASelectorExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder - .appName("ANOVATestExample") + .appName("ANOVASelectorExample") .getOrCreate() import spark.implicits._ // $example on$ val data = Seq( - (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), - (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), - (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), - (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), - (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) + (1, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3), 3.0), + (2, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1), 2.0), + (3, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5), 3.0), + (4, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8), 2.0), + (5, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0), 4.0), + (6, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1), 4.0) ) - val df = data.toDF("label", "features") - val anova = ANOVATest.test(df, "features", "label").head - println(s"pValues = ${anova.getAs[Vector](0)}") - println(s"degreesOfFreedom ${anova.getSeq[Int](1).mkString("[", ",", "]")}") - println(s"fValues ${anova.getAs[Vector](2)}") + val df = spark.createDataset(data).toDF("id", "features", "label") + + val selector = new ANOVASelector() + .setNumTopFeatures(1) + .setFeaturesCol("features") + .setLabelCol("label") + .setOutputCol("selectedFeatures") + + val result = selector.fit(df).transform(df) + + println(s"ANOVASelector output with top ${selector.getNumTopFeatures} features selected") + result.show() // $example off$ spark.stop() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala index 0cd793f..f0b9f23 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala @@ -44,7 +44,7 @@ object ANOVATestExample { val data = Seq( (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), + (3.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala similarity index 54% copy from examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala copy to examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala index 0cd793f..914d81b 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/ANOVATestExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/FValueSelectorExample.scala @@ -19,42 +19,48 @@ package org.apache.spark.examples.ml // $example on$ -import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.stat.ANOVATest +import org.apache.spark.ml.feature.FValueSelector +import org.apache.spark.ml.linalg.Vectors // $example off$ import org.apache.spark.sql.SparkSession /** - * An example for ANOVA testing. + * An example for FValueSelector. * Run with * {{{ - * bin/run-example ml.ANOVATestExample + * bin/run-example ml.FValueSelectorExample * }}} */ -object ANOVATestExample { - +object FValueSelectorExample { def main(args: Array[String]): Unit = { val spark = SparkSession .builder - .appName("ANOVATestExample") + .appName("FValueSelectorExample") .getOrCreate() import spark.implicits._ // $example on$ val data = Seq( - (3.0, Vectors.dense(1.7, 4.4, 7.6, 5.8, 9.6, 2.3)), - (2.0, Vectors.dense(8.8, 7.3, 5.7, 7.3, 2.2, 4.1)), - (1.0, Vectors.dense(1.2, 9.5, 2.5, 3.1, 8.7, 2.5)), - (2.0, Vectors.dense(3.7, 9.2, 6.1, 4.1, 7.5, 3.8)), - (4.0, Vectors.dense(8.9, 5.2, 7.8, 8.3, 5.2, 3.0)), - (4.0, Vectors.dense(7.9, 8.5, 9.2, 4.0, 9.4, 2.1)) + (1, Vectors.dense(6.0, 7.0, 0.0, 7.0, 6.0, 0.0), 4.6), + (2, Vectors.dense(0.0, 9.0, 6.0, 0.0, 5.0, 9.0), 6.6), + (3, Vectors.dense(0.0, 9.0, 3.0, 0.0, 5.0, 5.0), 5.1), + (4, Vectors.dense(0.0, 9.0, 8.0, 5.0, 6.0, 4.0), 7.6), + (5, Vectors.dense(8.0, 9.0, 6.0, 5.0, 4.0, 4.0), 9.0), + (6, Vectors.dense(8.0, 9.0, 6.0, 4.0, 0.0, 0.0), 9.0) ) - val df = data.toDF("label", "features") - val anova = ANOVATest.test(df, "features", "label").head - println(s"pValues = ${anova.getAs[Vector](0)}") - println(s"degreesOfFreedom ${anova.getSeq[Int](1).mkString("[", ",", "]")}") - println(s"fValues ${anova.getAs[Vector](2)}") + val df = spark.createDataset(data).toDF("id", "features", "label") + + val selector = new FValueSelector() + .setNumTopFeatures(1) + .setFeaturesCol("features") + .setLabelCol("label") + .setOutputCol("selectedFeatures") + + val result = selector.fit(df).transform(df) + + println(s"FValueSelector output with top ${selector.getNumTopFeatures} features selected") + result.show() // $example off$ spark.stop() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/FVlaueTestExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala similarity index 100% rename from examples/src/main/scala/org/apache/spark/examples/ml/FVlaueTestExample.scala rename to examples/src/main/scala/org/apache/spark/examples/ml/FValueTestExample.scala --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org