[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-20 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84232802
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
+val tempRDD = chiSqTestResult
+  .sortBy { case (res, _) => res.pValue }
+val maxIndex = tempRDD
+  .zipWithIndex
--- End diff --

I have added a large size data sample in the test Suite, and updated the 
Contingency tables in the test Suite comments. The value of degree of freedom, 
statistic and pValue for each feature is also added. so it is easy to validate 
the result.

SelectFDR in sklearn is not exact. My PR to fix the bug is merged today. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84049805
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
+   * alpha means the highest uncorrected p-value for features to be kept 
when select type
--- End diff --

updated, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84049606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
--- End diff --

updated, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84043945
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -127,7 +134,9 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 
   /** @group setParam */
   @Since("2.1.0")
-  def setAlpha(value: Double): this.type = set(alpha, value)
+  def setAlpha(value: Double): this.type = {
--- End diff --

Why this change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84044151
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
--- End diff --

It's definitely used -- you have to keep the original index in order to 
pass them to the model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84044410
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -263,9 +278,15 @@ object ChiSqSelector {
   /** String name for `fpr` selector type. */
   private[spark] val FPR: String = "fpr"
 
+  /** String name for `fdr` selector type. */
+  private[spark] val FDR: String = "fdr"
--- End diff --

I know this applies to the existing line above too, but, this comment isn't 
descriptive. You can spell out what all of these mean if there's javadoc at all 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84043926
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -104,6 +108,9 @@ private[feature] trait ChiSqSelectorParams extends 
Params
  * `kbest` chooses the `k` top features according to a chi-squared test.
  * `percentile` is similar but chooses a fraction of all features instead 
of a fixed number.
  * `fpr` chooses all features whose false positive rate meets some 
threshold.
+ * `fpr` select features based on a false positive rate test.
--- End diff --

This has two lines for fpr. The existing text is more descriptive. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84044249
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
+val tempRDD = chiSqTestResult
+  .sortBy { case (res, _) => res.pValue }
+val maxIndex = tempRDD
+  .zipWithIndex
--- End diff --

This zipWithIndex is however not correct it seems. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84042201
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
+   * alpha means the highest uncorrected p-value for features to be kept 
when select type
--- End diff --

```, or the highest ...```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84040939
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
--- End diff --

Irrelevent to this PR, we can eliminate ```zipWithIndex``` at L235, since 
no one use it. Would you mind to do this clean up in your PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84042900
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala ---
@@ -82,6 +82,18 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext
 case Row(vec1: Vector, vec2: Vector) =>
   assert(vec1 ~== vec2 absTol 1e-1)
   }
+
+selector.setSelectorType("fwe").setAlpha(0.5).fit(df2).transform(df2)
+  .select("filtered", "preFilteredData").collect().foreach {
+  case Row(vec1: Vector, vec2: Vector) =>
+assert(vec1 ~== vec2 absTol 1e-1)
+}
+
+selector.setSelectorType("fdr").setAlpha(0.21).fit(df2).transform(df2)
+  .select("filtered", "preFilteredData").collect().foreach {
+  case Row(vec1: Vector, vec2: Vector) =>
+assert(vec1 ~== vec2 absTol 1e-1)
+}
--- End diff --

Enforce this test case to large dimension data, and output different 
selected features according to selector type as far as possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84041266
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
+val tempRDD = chiSqTestResult
+  .sortBy { case (res, _) => res.pValue }
+val maxIndex = tempRDD
+  .zipWithIndex
+  .filter { case ((res, _), index) =>
+res.pValue <= alpha * (index + 1) / chiSqTestResult.length }
+  .map { case (_, index) => index}
--- End diff --

```index}``` to ```index }```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84042933
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2624,7 +2624,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, 
HasOutputCol, HasLabelCol, Ja
"will select, ordered by statistics value 
descending.",
typeConverter=TypeConverters.toFloat)
 
-alpha = Param(Params._dummy(), "alpha", "The highest p-value for 
features to be kept.",
+alpha = Param(Params._dummy(), "alpha", "alpha means the highest 
p-value for features " +
+  "to be kept when select type is fpr, alpha means the 
highest uncorrected " +
--- End diff --

Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84041851
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
--- End diff --

We should keep ```Only applicable when selectorType = "fpr", "fdr" or 
"fwe".```, since it's not applicable to other selector types such as "kbest" 
and "percentile".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84041519
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
+val tempRDD = chiSqTestResult
--- End diff --

Use another name, since it's not an RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84042944
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2700,7 +2702,6 @@ def getPercentile(self):
 def setAlpha(self, value):
 """
 Sets the value of :py:attr:`alpha`.
-Only applicable when selectorType = "fpr".
--- End diff --

Ditto.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84042250
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
+   * alpha means the highest uncorrected p-value for features to be kept 
when select type
+   * is "fdr" and "fwe".
* Default value is 0.05.
*/
-  final val alpha = new DoubleParam(this, "alpha", "The highest p-value 
for features to be kept.",
+  final val alpha = new DoubleParam(this, "alpha",
+"alpha means the highest p-value for features to be kept when select 
type is fpr, " +
+  "alpha means the highest uncorrected p-value for features to be kept 
when select type " +
--- End diff --

```, or the highest ...```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84038840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends 
Serializable {
   case ChiSqSelector.FPR =>
 chiSqTestResult
   .filter { case (res, _) => res.pValue < alpha }
+  case ChiSqSelector.FDR =>
--- End diff --

Add docs to clarify ```This uses the Benjamini-Hochberg procedure```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org