[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/12467 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218967787 Merged to master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218854678 pinging @MLnick --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218829934 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218829938 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58503/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218829747 **[Test build #58503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58503/consoleFull)** for PR 12467 at commit [`3786ef9`](https://github.com/apache/spark/commit/3786ef9b382c8f1942bb1d81812586b7c06cfb5c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218819415 **[Test build #58503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58503/consoleFull)** for PR 12467 at commit [`3786ef9`](https://github.com/apache/spark/commit/3786ef9b382c8f1942bb1d81812586b7c06cfb5c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218817929 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58502/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218817889 **[Test build #58502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58502/consoleFull)** for PR 12467 at commit [`23f80ee`](https://github.com/apache/spark/commit/23f80eeb1d64e8701daf30c499aae0fbc2739d40). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218817926 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218815288 **[Test build #58502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58502/consoleFull)** for PR 12467 at commit [`23f80ee`](https://github.com/apache/spark/commit/23f80eeb1d64e8701daf30c499aae0fbc2739d40). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218744614 @yanboliang thanks a lot! Will rebase soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218743714 Sorry for late response. This LGTM after the conflicts be resolved. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r63011688 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +290,23 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { +val formula = new RFormula().setFormula("label ~ features") + .setLabelCol("x") + .setFeaturesCol("y") +val dfs = MLTestingUtils.genRegressionDFWithNumericLabelCol(sqlContext) +val expected = formula.fit(dfs(DoubleType)) +val actuals = dfs.keys.filter(_ != DoubleType).map(t => formula.fit(dfs(t))) +actuals.foreach { actual => + assert(expected.pipelineModel.stages.length === actual.pipelineModel.stages.length) + expected.pipelineModel.stages.zip(actual.pipelineModel.stages).foreach { +case (exTransformer, acTransformer) => + assert(exTransformer.params === acTransformer.params) + } + assert(expected.resolvedFormula.label === actual.resolvedFormula.label) + assert(expected.resolvedFormula.terms === actual.resolvedFormula.terms) + assert(expected.resolvedFormula.hasIntercept === actual.resolvedFormula.hasIntercept) +} --- End diff -- Yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-218730580 Ping @yanboliang --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r61649874 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +290,23 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { +val formula = new RFormula().setFormula("label ~ features") + .setLabelCol("x") + .setFeaturesCol("y") +val dfs = MLTestingUtils.genRegressionDFWithNumericLabelCol(sqlContext) +val expected = formula.fit(dfs(DoubleType)) +val actuals = dfs.keys.filter(_ != DoubleType).map(t => formula.fit(dfs(t))) +actuals.foreach { actual => + assert(expected.pipelineModel.stages.length === actual.pipelineModel.stages.length) + expected.pipelineModel.stages.zip(actual.pipelineModel.stages).foreach { +case (exTransformer, acTransformer) => + assert(exTransformer.params === acTransformer.params) + } + assert(expected.resolvedFormula.label === actual.resolvedFormula.label) + assert(expected.resolvedFormula.terms === actual.resolvedFormula.terms) + assert(expected.resolvedFormula.hasIntercept === actual.resolvedFormula.hasIntercept) +} --- End diff -- @yanboliang is this what you had in mind? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215886842 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57359/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215886840 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215886734 **[Test build #57359 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57359/consoleFull)** for PR 12467 at commit [`79b0f9d`](https://github.com/apache/spark/commit/79b0f9d201f560a203b69630399dd348f62aaae6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215878828 **[Test build #57359 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57359/consoleFull)** for PR 12467 at commit [`79b0f9d`](https://github.com/apache/spark/commit/79b0f9d201f560a203b69630399dd348f62aaae6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r61384769 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- Sure, thanks for your input. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215308986 Look good overall, I have my last inline comment. After that, it should be ready to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r61372578 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- @BenFradet Sorry for late response. FYI: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/r/RWrappers.scala#L44 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-215190221 This LGTM. @yanboliang anything further? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60735190 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- @MLnick The code in master is right. For the first example: ``` scala> val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF("x", "y") original: org.apache.spark.sql.DataFrame = [x: int, y: int] scala> formula.fit(original).transform(original).show +---+---++-+ | x| y|features|label| +---+---++-+ | 0| 1| [0.0]| 1.0| | 2| 2| [2.0]| 2.0| +---+---++-+ ``` It confirms with the design of ```RFormula``` which transform dataset with column names ```a, b``` to two columns: ```label``` and ```features```. For the second one: It make sense that L244 catch this exception. In this example, we can not transform ```y``` whose type is ```Seq``` to ```label```, so throw exception inside of function ```transformLabel``` is reasonable. In ```checkCanTransform``` we only have two kinds of guarantee: #1 Label column not exists. #2 If label column exists, it must be type of ```Double/Numeric```. If we are in this situation, we will fall into branch ```if (hasLabelCol(dataset.schema))``` of ```transformLabel```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60730417 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- ah I see now, never mind. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60729785 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- e.g. before this PR this works (and I don't believe it's supposed to?). ``` scala> val original = sqlContext.createDataFrame(Seq((0, 1), (2, 2))).toDF("x", "y") original: org.apache.spark.sql.DataFrame = [x: int, y: int] scala> formula.fit(original).transform(original).show +---+---++-+ | x| y|features|label| +---+---++-+ | 0| 1| [0.0]| 1.0| | 2| 2| [2.0]| 2.0| +---+---++-+ ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60729720 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- +1 @BenFradet It should be ```||``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60729110 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- I don't think so no. What do you think @yanboliang ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60727908 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala --- @@ -254,8 +254,8 @@ class RFormulaModel private[feature]( val columnNames = schema.map(_.name) require(!columnNames.contains($(featuresCol)), "Features column already exists.") require( - !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType == DoubleType, - "Label column already exists and is not of type DoubleType.") + !columnNames.contains($(labelCol)) || schema($(labelCol)).dataType.isInstanceOf[NumericType], --- End diff -- Should the `||` not be `&&`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-212122934 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-212122941 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56258/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-212122615 **[Test build #56258 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56258/consoleFull)** for PR 12467 at commit [`4cb27cf`](https://github.com/apache/spark/commit/4cb27cfe331eb6cc985aaef5c8187d0419dd5474). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-212108645 **[Test build #56258 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56258/consoleFull)** for PR 12467 at commit [`4cb27cf`](https://github.com/apache/spark/commit/4cb27cfe331eb6cc985aaef5c8187d0419dd5474). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60300681 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- After reviewing the suite, I don't think the same tests apply since RFormula also accepts string labels. Consequently, I think it's best as is. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60255560 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- Or simply just: ```scala val schema = dataset.schema SchemaUtils.checkNumericType(schema, $(labelCol)) ``` at the beginning of RFormula's `fit` method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60250070 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- It'd work expect for the expected exception when dealing with a dataframe containing string labels because the label column gets indexed by RFormula's `fit`. Consequently, an exception is thrown by `StringIndexer`. What I could do is add a `validateSchema` to RFormula (called at the beginiing of the `fit` method) checking that the label column is of numeric type, then I could use `MLTestingUtils.checkNumericTypes`. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/12467#discussion_r60245818 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala --- @@ -290,4 +291,18 @@ class RFormulaSuite extends SparkFunSuite with MLlibTestSparkContext with Defaul val newModel = testDefaultReadWrite(model) checkModelData(model, newModel) } + + test("should support all NumericType labels") { --- End diff -- Can we use ```MLTestingUtils.checkNumericTypes``` to test this? It will eliminate some redundant code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211898854 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211898857 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56220/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211898483 **[Test build #56220 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56220/consoleFull)** for PR 12467 at commit [`c9edad0`](https://github.com/apache/spark/commit/c9edad04113f2c75980b2cfc2f74128f7f5259ad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211876908 **[Test build #56220 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56220/consoleFull)** for PR 12467 at commit [`c9edad0`](https://github.com/apache/spark/commit/c9edad04113f2c75980b2cfc2f74128f7f5259ad). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211516223 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211516230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56068/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211516011 **[Test build #56068 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56068/consoleFull)** for PR 12467 at commit [`f05c217`](https://github.com/apache/spark/commit/f05c217e0e87a9f9758a1d380b80d1f39751fffe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211487432 **[Test build #56068 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56068/consoleFull)** for PR 12467 at commit [`f05c217`](https://github.com/apache/spark/commit/f05c217e0e87a9f9758a1d380b80d1f39751fffe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211467763 @yanboliang yup, no problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211466549 @BenFradet I'm sorry that I did not notice the #10355 has been merged. Please ignore my last comments because it's not valid after #10355. Would you mind to add RFormula support back? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211457885 @yanboliang thanks for your input, I reverted the affected commits --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12467#issuecomment-211451163 @BenFradet Thanks for this PR. I think we can not make ```RFormula``` supporting other numeric types for label as your proposal. If the label column already exists, it must be type of DoubleType, otherwise it will cause the downstream model can not recognize the label column when ```validateAndTransformSchema```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13961][ML] spark.ml ChiSqSelector and R...
GitHub user BenFradet opened a pull request: https://github.com/apache/spark/pull/12467 [SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label ## What changes were proposed in this pull request? Made ChiSqSelector and RFormula accept all numeric types for label ## How was this patch tested? Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/BenFradet/spark SPARK-13961 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12467.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12467 commit ac7db9ec38e72655b8312dfbd8e39b9c7cf84520 Author: BenFradet Date: 2016-04-06T16:53:08Z ChiSqSelector now accepts numeric labels commit cd62e84fb54471edbf88a369bde8e51398894450 Author: BenFradet Date: 2016-04-06T18:03:46Z spec for ChiSqSelector accepting numeric types commit 0bc5057c14552739dcad3be0c5bd1b58e4849804 Author: BenFradet Date: 2016-04-18T07:04:21Z RFormula now accepts all numeric type commit c9edad04113f2c75980b2cfc2f74128f7f5259ad Author: BenFradet Date: 2016-04-18T07:04:51Z spec for RFormula accepting all numeric types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org