[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84232802 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => +val tempRDD = chiSqTestResult + .sortBy { case (res, _) => res.pValue } +val maxIndex = tempRDD + .zipWithIndex --- End diff -- I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result. SelectFDR in sklearn is not exact. My PR to fix the bug is merged today. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84049805 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". + * alpha means the highest uncorrected p-value for features to be kept when select type --- End diff -- updated, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84049606 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". --- End diff -- updated, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84043945 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -127,7 +134,9 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str /** @group setParam */ @Since("2.1.0") - def setAlpha(value: Double): this.type = set(alpha, value) + def setAlpha(value: Double): this.type = { --- End diff -- Why this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84044151 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => --- End diff -- It's definitely used -- you have to keep the original index in order to pass them to the model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84044410 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -263,9 +278,15 @@ object ChiSqSelector { /** String name for `fpr` selector type. */ private[spark] val FPR: String = "fpr" + /** String name for `fdr` selector type. */ + private[spark] val FDR: String = "fdr" --- End diff -- I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84043926 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -104,6 +108,9 @@ private[feature] trait ChiSqSelectorParams extends Params * `kbest` chooses the `k` top features according to a chi-squared test. * `percentile` is similar but chooses a fraction of all features instead of a fixed number. * `fpr` chooses all features whose false positive rate meets some threshold. + * `fpr` select features based on a false positive rate test. --- End diff -- This has two lines for fpr. The existing text is more descriptive. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84044249 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => +val tempRDD = chiSqTestResult + .sortBy { case (res, _) => res.pValue } +val maxIndex = tempRDD + .zipWithIndex --- End diff -- This zipWithIndex is however not correct it seems. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84042201 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". + * alpha means the highest uncorrected p-value for features to be kept when select type --- End diff -- ```, or the highest ...``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84040939 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => --- End diff -- Irrelevent to this PR, we can eliminate ```zipWithIndex``` at L235, since no one use it. Would you mind to do this clean up in your PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84042900 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala --- @@ -82,6 +82,18 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext case Row(vec1: Vector, vec2: Vector) => assert(vec1 ~== vec2 absTol 1e-1) } + +selector.setSelectorType("fwe").setAlpha(0.5).fit(df2).transform(df2) + .select("filtered", "preFilteredData").collect().foreach { + case Row(vec1: Vector, vec2: Vector) => +assert(vec1 ~== vec2 absTol 1e-1) +} + +selector.setSelectorType("fdr").setAlpha(0.21).fit(df2).transform(df2) + .select("filtered", "preFilteredData").collect().foreach { + case Row(vec1: Vector, vec2: Vector) => +assert(vec1 ~== vec2 absTol 1e-1) +} --- End diff -- Enforce this test case to large dimension data, and output different selected features according to selector type as far as possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84041266 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => +val tempRDD = chiSqTestResult + .sortBy { case (res, _) => res.pValue } +val maxIndex = tempRDD + .zipWithIndex + .filter { case ((res, _), index) => +res.pValue <= alpha * (index + 1) / chiSqTestResult.length } + .map { case (_, index) => index} --- End diff -- ```index}``` to ```index }``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84042933 --- Diff: python/pyspark/ml/feature.py --- @@ -2624,7 +2624,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja "will select, ordered by statistics value descending.", typeConverter=TypeConverters.toFloat) -alpha = Param(Params._dummy(), "alpha", "The highest p-value for features to be kept.", +alpha = Param(Params._dummy(), "alpha", "alpha means the highest p-value for features " + + "to be kept when select type is fpr, alpha means the highest uncorrected " + --- End diff -- Ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84041851 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". --- End diff -- We should keep ```Only applicable when selectorType = "fpr", "fdr" or "fwe".```, since it's not applicable to other selector types such as "kbest" and "percentile". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84041519 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => +val tempRDD = chiSqTestResult --- End diff -- Use another name, since it's not an RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84042944 --- Diff: python/pyspark/ml/feature.py --- @@ -2700,7 +2702,6 @@ def getPercentile(self): def setAlpha(self, value): """ Sets the value of :py:attr:`alpha`. -Only applicable when selectorType = "fpr". --- End diff -- Ditto. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84042250 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". + * alpha means the highest uncorrected p-value for features to be kept when select type + * is "fdr" and "fwe". * Default value is 0.05. */ - final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.", + final val alpha = new DoubleParam(this, "alpha", +"alpha means the highest p-value for features to be kept when select type is fpr, " + + "alpha means the highest uncorrected p-value for features to be kept when select type " + --- End diff -- ```, or the highest ...``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84038840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { case ChiSqSelector.FPR => chiSqTestResult .filter { case (res, _) => res.pValue < alpha } + case ChiSqSelector.FDR => --- End diff -- Add docs to clarify ```This uses the Benjamini-Hochberg procedure```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org