spark git commit: [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)

yliang Wed, 28 Dec 2016 00:49:55 -0800

Repository: spark
Updated Branches:
  refs/heads/master 2af8b5cff -> 79ff85363



[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery 
Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 
25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure 
in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, 
or type I errors, among all the hypotheses when performing multiple hypotheses 
tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is 
implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, 
manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Peng <peng.m...@intel.com>
Author: Peng, Meng <peng.m...@intel.com>

Closes #15212 from mpjlu/fdr_fwe.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/79ff8536
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/79ff8536
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/79ff8536

Branch: refs/heads/master
Commit: 79ff8536315aef97ee940c52d71cd8de777c7ce6
Parents: 2af8b5c
Author: Peng <peng.m...@intel.com>
Authored: Wed Dec 28 00:49:36 2016 -0800
Committer: Yanbo Liang <yblia...@gmail.com>
Committed: Wed Dec 28 00:49:36 2016 -0800

----------------------------------------------------------------------
 docs/ml-features.md                             |   6 +-
 docs/mllib-feature-extraction.md                |   4 +-
 .../apache/spark/ml/feature/ChiSqSelector.scala |  48 +++++-
 .../spark/mllib/api/python/PythonMLLibAPI.scala |   4 +
 .../spark/mllib/feature/ChiSqSelector.scala     |  62 ++++++--
 .../spark/ml/feature/ChiSqSelectorSuite.scala   |   6 +
 .../mllib/feature/ChiSqSelectorSuite.scala      | 147 +++++++++++++++----
 python/pyspark/ml/feature.py                    |  74 +++++++++-
 python/pyspark/mllib/feature.py                 |  50 ++++++-
 9 files changed, 337 insertions(+), 64 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index ca1ccc4..1d34497 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1423,12 +1423,12 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
-
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
-
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values is below a threshold, thus 
controlling the family-wise error rate of selection.
 By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
 The user can choose a selection method using `setSelectorType`.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 42568c3..acd2894 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -227,11 +227,13 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with categorical 
features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
+features to choose. It supports five selection methods: `numTopFeatures`, 
`percentile`, `fpr`, `fdr`, `fwe`:
 
 * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test. This is akin to yielding the features with the most 
predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all 
features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
+* `fdr` uses the [Benjamini-Hochberg 
procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
 to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values is below a threshold, thus 
controlling the family-wise error rate of selection.
 
 By default, the selection method is `numTopFeatures`, with the default number 
of top features set to 50.
 The user can choose a selection method using `setSelectorType`.

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 8699929..353bd18 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
   def getFpr: Double = $(fpr)
 
   /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", 
ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", 
ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
    * The selector type of the ChisqSelector.
-   * Supported options: "numTopFeatures" (default), "percentile", "fpr".
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", 
"fdr", "fwe".
    * @group param
    */
   @Since("2.1.0")
@@ -111,11 +139,17 @@ private[feature] trait ChiSqSelectorParams extends Params
 /**
  * Chi-Squared feature selection, which selects categorical features to use 
for predicting a
  * categorical label.
- * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features instead 
of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false
  *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values is below a threshold,
+ *    thus controlling the family-wise error rate of selection.
  * By default, the selection method is `numTopFeatures`, with the default 
number of top features
  * set to 50.
  */
@@ -139,6 +173,14 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") 
override val uid: Str
   def setFpr(value: Double): this.type = set(fpr, value)
 
   /** @group setParam */
+  @Since("2.2.0")
+  def setFdr(value: Double): this.type = set(fdr, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFwe(value: Double): this.type = set(fwe, value)
+
+  /** @group setParam */
   @Since("2.1.0")
   def setSelectorType(value: String): this.type = set(selectorType, value)
 
@@ -167,6 +209,8 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") 
override val uid: Str
       .setNumTopFeatures($(numTopFeatures))
       .setPercentile($(percentile))
       .setFpr($(fpr))
+      .setFdr($(fdr))
+      .setFwe($(fwe))
     val model = selector.fit(input)
     copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 034e362..b32d3f2 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -639,12 +639,16 @@ private[python] class PythonMLLibAPI extends Serializable 
{
       numTopFeatures: Int,
       percentile: Double,
       fpr: Double,
+      fdr: Double,
+      fwe: Double,
       data: JavaRDD[LabeledPoint]): ChiSqSelectorModel = {
     new ChiSqSelector()
       .setSelectorType(selectorType)
       .setNumTopFeatures(numTopFeatures)
       .setPercentile(percentile)
       .setFpr(fpr)
+      .setFdr(fdr)
+      .setFwe(fwe)
       .fit(data.rdd)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
index 7ef2a95..9dea3c3 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
@@ -171,11 +171,17 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features instead 
of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false
  *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values is below a threshold,
+ *    thus controlling the family-wise error rate of selection.
  * By default, the selection method is `numTopFeatures`, with the default 
number of top features
  * set to 50.
  */
@@ -184,6 +190,8 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable 
{
   var numTopFeatures: Int = 50
   var percentile: Double = 0.1
   var fpr: Double = 0.05
+  var fdr: Double = 0.05
+  var fwe: Double = 0.05
   var selectorType = ChiSqSelector.NumTopFeatures
 
   /**
@@ -215,6 +223,20 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
     this
   }
 
+  @Since("2.2.0")
+  def setFdr(value: Double): this.type = {
+    require(0.0 <= value && value <= 1.0, "FDR must be in [0,1]")
+    fdr = value
+    this
+  }
+
+  @Since("2.2.0")
+  def setFwe(value: Double): this.type = {
+    require(0.0 <= value && value <= 1.0, "FWE must be in [0,1]")
+    fwe = value
+    this
+  }
+
   @Since("2.1.0")
   def setSelectorType(value: String): this.type = {
     require(ChiSqSelector.supportedSelectorTypes.contains(value),
@@ -245,6 +267,21 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
       case ChiSqSelector.FPR =>
         chiSqTestResult
           .filter { case (res, _) => res.pValue < fpr }
+      case ChiSqSelector.FDR =>
+        // This uses the Benjamini-Hochberg procedure.
+        // 
https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
+        val tempRes = chiSqTestResult
+          .sortBy { case (res, _) => res.pValue }
+        val maxIndex = tempRes
+          .zipWithIndex
+          .filter { case ((res, _), index) =>
+            res.pValue <= fdr * (index + 1) / chiSqTestResult.length }
+          .map { case (_, index) => index }
+          .max
+        tempRes.take(maxIndex + 1)
+      case ChiSqSelector.FWE =>
+        chiSqTestResult
+          .filter { case (res, _) => res.pValue < fwe / chiSqTestResult.length 
}
       case errorType =>
         throw new IllegalStateException(s"Unknown ChiSqSelector Type: 
$errorType")
     }
@@ -255,19 +292,22 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
 
 private[spark] object ChiSqSelector {
 
-  /**
-   * String name for `numTopFeatures` selector type.
-   */
-  val NumTopFeatures: String = "numTopFeatures"
+  /** String name for `numTopFeatures` selector type. */
+  private[spark] val NumTopFeatures: String = "numTopFeatures"
 
-  /**
-   * String name for `percentile` selector type.
-   */
-  val Percentile: String = "percentile"
+  /** String name for `percentile` selector type. */
+  private[spark] val Percentile: String = "percentile"
 
   /** String name for `fpr` selector type. */
-  val FPR: String = "fpr"
+  private[spark] val FPR: String = "fpr"
+
+  /** String name for `fdr` selector type. */
+  private[spark] val FDR: String = "fdr"
+
+  /** String name for `fwe` selector type. */
+  private[spark] val FWE: String = "fwe"
+
 
   /** Set of selector types that ChiSqSelector supports. */
-  val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, 
Percentile, FPR)
+  val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, 
Percentile, FPR, FDR, FWE)
 }

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
index 80970fd..f6c68b9 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
@@ -79,6 +79,12 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext
     ChiSqSelectorSuite.testSelector(selector, dataset)
   }
 
+  test("Test Chi-Square selector: fwe") {
+    val selector = new ChiSqSelector()
+      .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
+    ChiSqSelectorSuite.testSelector(selector, dataset)
+  }
+
   test("read/write") {
     def checkModelData(model: ChiSqSelectorModel, model2: ChiSqSelectorModel): 
Unit = {
       assert(model.selectedFeatures === model2.selectedFeatures)

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
index 77219e5..305cb4c 100644
--- 
a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
+++ 
b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
@@ -27,60 +27,143 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 
   /*
    *  Contingency tables
-   *  feature0 = {8.0, 0.0}
+   *  feature0 = {6.0, 0.0, 8.0}
    *  class  0 1 2
-   *    8.0||1|0|1|
-   *    0.0||0|2|0|
+   *    6.0||1|0|0|
+   *    0.0||0|3|0|
+   *    8.0||0|0|2|
+   *  degree of freedom = 4, statistic = 12, pValue = 0.017
    *
    *  feature1 = {7.0, 9.0}
    *  class  0 1 2
    *    7.0||1|0|0|
-   *    9.0||0|2|1|
+   *    9.0||0|3|2|
+   *  degree of freedom = 2, statistic = 6, pValue = 0.049
    *
-   *  feature2 = {0.0, 6.0, 8.0, 5.0}
+   *  feature2 = {0.0, 6.0, 3.0, 8.0}
    *  class  0 1 2
    *    0.0||1|0|0|
-   *    6.0||0|1|0|
+   *    6.0||0|1|2|
+   *    3.0||0|1|0|
    *    8.0||0|1|0|
-   *    5.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.66, pValue = 0.193
+   *
+   *  feature3 = {7.0, 0.0, 5.0, 4.0}
+   *  class  0 1 2
+   *    7.0||1|0|0|
+   *    0.0||0|2|0|
+   *    5.0||0|1|1|
+   *    4.0||0|0|1|
+   *  degree of freedom = 6, statistic = 9.5, pValue = 0.147
+   *
+   *  feature4 = {6.0, 5.0, 4.0, 0.0}
+   *  class  0 1 2
+   *    6.0||1|1|0|
+   *    5.0||0|2|0|
+   *    4.0||0|0|1|
+   *    0.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.0, pValue = 0.238
+   *
+   *  feature5 = {0.0, 9.0, 5.0, 4.0}
+   *  class  0 1 2
+   *    0.0||1|0|1|
+   *    9.0||0|1|0|
+   *    5.0||0|1|0|
+   *    4.0||0|1|1|
+   *  degree of freedom = 6, statistic = 5, pValue = 0.54
    *
    *  Use chi-squared calculator from Internet
    */
 
-  test("ChiSqSelector transform test (sparse & dense vector)") {
-    val labeledDiscreteData = sc.parallelize(
-      Seq(LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))),
-        LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))), 2)
+  lazy val labeledDiscreteData = sc.parallelize(
+    Seq(LabeledPoint(0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 
7.0), (4, 6.0)))),
+      LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), 
(5, 9.0)))),
+      LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), 
(5, 5.0)))),
+      LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0))),
+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))),
+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 
2)
+
+  test("ChiSqSelector transform by numTopFeatures test (sparse & dense 
vector)") {
     val preFilteredData =
-      Seq(LabeledPoint(0.0, Vectors.dense(Array(8.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0))))
-    val model = new ChiSqSelector(1).fit(labeledDiscreteData)
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new ChiSqSelector(3).fit(labeledDiscreteData)
     val filteredData = labeledDiscreteData.map { lp =>
       LabeledPoint(lp.label, model.transform(lp.features))
-    }.collect().toSeq
+    }.collect().toSet
     assert(filteredData === preFilteredData)
   }
 
-  test("ChiSqSelector by fpr transform test (sparse & dense vector)") {
-    val labeledDiscreteData = sc.parallelize(
-      Seq(LabeledPoint(0.0, Vectors.sparse(4, Array((0, 8.0), (1, 7.0)))),
-        LabeledPoint(1.0, Vectors.sparse(4, Array((1, 9.0), (2, 6.0), (3, 
4.0)))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 4.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0, 9.0)))), 2)
+  test("ChiSqSelector transform by Percentile test (sparse & dense vector)") {
     val preFilteredData =
-      Seq(LabeledPoint(0.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(4.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(4.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(9.0))))
-    val model: ChiSqSelectorModel = new ChiSqSelector().setSelectorType("fpr")
-      .setFpr(0.1).fit(labeledDiscreteData)
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new 
ChiSqSelector().setSelectorType("percentile").setPercentile(0.5)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FPR test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fpr").setFpr(0.15)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FDR test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fdr").setFdr(0.15)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FWE test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fwe").setFwe(0.3)
+      .fit(labeledDiscreteData)
     val filteredData = labeledDiscreteData.map { lp =>
       LabeledPoint(lp.label, model.transform(lp.features))
-    }.collect().toSeq
+    }.collect().toSet
     assert(filteredData === preFilteredData)
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/python/pyspark/ml/feature.py
----------------------------------------------------------------------
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 62c3143..dbd17e0 100755
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -2629,8 +2629,28 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, 
HasOutputCol, HasLabelCol, Ja
     """
     .. note:: Experimental
 
-    Chi-Squared feature selection, which selects categorical features to use 
for predicting a
-    categorical label.
+    Creates a ChiSquared feature selector.
+    The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+    `fdr`, `fwe`.
+
+     * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test.
+
+     * `percentile` is similar but chooses a fraction of all features
+       instead of a fixed number.
+
+     * `fpr` chooses all features whose p-value is below a threshold,
+       thus controlling the false positive rate of selection.
+
+     * `fdr` uses the `Benjamini-Hochberg procedure 
<https://en.wikipedia.org/wiki/
+       False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
+       to choose all features whose false discovery rate is below a threshold.
+
+     * `fwe` chooses all features whose p-values is below a threshold,
+       thus controlling the family-wise error rate of selection.
+
+    By default, the selection method is `numTopFeatures`, with the default 
number of top features
+    set to 50.
+
 
     >>> from pyspark.ml.linalg import Vectors
     >>> df = spark.createDataFrame(
@@ -2676,27 +2696,37 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, 
HasOutputCol, HasLabelCol, Ja
     fpr = Param(Params._dummy(), "fpr", "The highest p-value for features to 
be kept.",
                 typeConverter=TypeConverters.toFloat)
 
+    fdr = Param(Params._dummy(), "fdr", "The upper bound of the expected false 
discovery rate.",
+                typeConverter=TypeConverters.toFloat)
+
+    fwe = Param(Params._dummy(), "fwe", "The upper bound of the expected 
family-wise error rate.",
+                typeConverter=TypeConverters.toFloat)
+
     @keyword_only
     def __init__(self, numTopFeatures=50, featuresCol="features", 
outputCol=None,
-                 labelCol="label", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05):
+                 labelCol="label", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05,
+                 fdr=0.05, fwe=0.05):
         """
         __init__(self, numTopFeatures=50, featuresCol="features", 
outputCol=None, \
-                 labelCol="label", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05)
+                 labelCol="label", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05, \
+                 fdr=0.05, fwe=0.05)
         """
         super(ChiSqSelector, self).__init__()
         self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.ChiSqSelector", self.uid)
         self._setDefault(numTopFeatures=50, selectorType="numTopFeatures", 
percentile=0.1,
-                         fpr=0.05)
+                         fpr=0.05, fdr=0.05, fwe=0.05)
         kwargs = self.__init__._input_kwargs
         self.setParams(**kwargs)
 
     @keyword_only
     @since("2.0.0")
     def setParams(self, numTopFeatures=50, featuresCol="features", 
outputCol=None,
-                  labelCol="labels", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05):
+                  labelCol="labels", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05,
+                  fdr=0.05, fwe=0.05):
         """
         setParams(self, numTopFeatures=50, featuresCol="features", 
outputCol=None, \
-                  labelCol="labels", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05)
+                  labelCol="labels", selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05, \
+                  fdr=0.05, fwe=0.05)
         Sets params for this ChiSqSelector.
         """
         kwargs = self.setParams._input_kwargs
@@ -2761,6 +2791,36 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, 
HasOutputCol, HasLabelCol, Ja
         """
         return self.getOrDefault(self.fpr)
 
+    @since("2.2.0")
+    def setFdr(self, value):
+        """
+        Sets the value of :py:attr:`fdr`.
+        Only applicable when selectorType = "fdr".
+        """
+        return self._set(fdr=value)
+
+    @since("2.2.0")
+    def getFdr(self):
+        """
+        Gets the value of fdr or its default value.
+        """
+        return self.getOrDefault(self.fdr)
+
+    @since("2.2.0")
+    def setFwe(self, value):
+        """
+        Sets the value of :py:attr:`fwe`.
+        Only applicable when selectorType = "fwe".
+        """
+        return self._set(fwe=value)
+
+    @since("2.2.0")
+    def getFwe(self):
+        """
+        Gets the value of fwe or its default value.
+        """
+        return self.getOrDefault(self.fwe)
+
     def _create_model(self, java_model):
         return ChiSqSelectorModel(java_model)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/79ff8536/python/pyspark/mllib/feature.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py
index bde0f67..61f2bc7 100644
--- a/python/pyspark/mllib/feature.py
+++ b/python/pyspark/mllib/feature.py
@@ -274,11 +274,24 @@ class ChiSqSelectorModel(JavaVectorTransformer):
 class ChiSqSelector(object):
     """
     Creates a ChiSquared feature selector.
-    The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
-    `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test.
-    `percentile` is similar but chooses a fraction of all features instead of 
a fixed number.
-    `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false
-    positive rate of selection.
+    The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`,
+    `fdr`, `fwe`.
+
+     * `numTopFeatures` chooses a fixed number of top features according to a 
chi-squared test.
+
+     * `percentile` is similar but chooses a fraction of all features
+       instead of a fixed number.
+
+     * `fpr` chooses all features whose p-value is below a threshold,
+       thus controlling the false positive rate of selection.
+
+     * `fdr` uses the `Benjamini-Hochberg procedure 
<https://en.wikipedia.org/wiki/
+       False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
+       to choose all features whose false discovery rate is below a threshold.
+
+     * `fwe` chooses all features whose p-values is below a threshold,
+       thus controlling the family-wise error rate of selection.
+
     By default, the selection method is `numTopFeatures`, with the default 
number of top features
     set to 50.
 
@@ -305,11 +318,14 @@ class ChiSqSelector(object):
 
     .. versionadded:: 1.4.0
     """
-    def __init__(self, numTopFeatures=50, selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05):
+    def __init__(self, numTopFeatures=50, selectorType="numTopFeatures", 
percentile=0.1, fpr=0.05,
+                 fdr=0.05, fwe=0.05):
         self.numTopFeatures = numTopFeatures
         self.selectorType = selectorType
         self.percentile = percentile
         self.fpr = fpr
+        self.fdr = fdr
+        self.fwe = fwe
 
     @since('2.1.0')
     def setNumTopFeatures(self, numTopFeatures):
@@ -338,11 +354,29 @@ class ChiSqSelector(object):
         self.fpr = float(fpr)
         return self
 
+    @since('2.2.0')
+    def setFdr(self, fdr):
+        """
+        set FDR [0.0, 1.0] for feature selection by FDR.
+        Only applicable when selectorType = "fdr".
+        """
+        self.fdr = float(fdr)
+        return self
+
+    @since('2.2.0')
+    def setFwe(self, fwe):
+        """
+        set FWE [0.0, 1.0] for feature selection by FWE.
+        Only applicable when selectorType = "fwe".
+        """
+        self.fwe = float(fwe)
+        return self
+
     @since('2.1.0')
     def setSelectorType(self, selectorType):
         """
         set the selector type of the ChisqSelector.
-        Supported options: "numTopFeatures" (default), "percentile", "fpr".
+        Supported options: "numTopFeatures" (default), "percentile", "fpr", 
"fdr", "fwe".
         """
         self.selectorType = str(selectorType)
         return self
@@ -358,7 +392,7 @@ class ChiSqSelector(object):
                      Apply feature discretizer before using this function.
         """
         jmodel = callMLlibFunc("fitChiSqSelector", self.selectorType, 
self.numTopFeatures,
-                               self.percentile, self.fpr, data)
+                               self.percentile, self.fpr, self.fdr, self.fwe, 
data)
         return ChiSqSelectorModel(jmodel)
 
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)

Reply via email to