[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69770334 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- Ah, I see. Thanks for the note. I wasn't familiar with the Dataset API till now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69705146 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- @jaceklaskowski I think it's better to do such improvement in a separate PR which will involve lots of similar changes for all test suites? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/13981 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69703134 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- Before `collect` so you work with type-safe `Dataset` (not a `DataFrame` which is `Dataset[Row]`) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69660324 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- You mean after the collect? It fails with `is not a member of Seq[org.apache.spark.sql.Row]` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69631781 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +97,25 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val varianceData: RDD[LabeledPoint] = TreeTests.varianceData(sc) +val varianceDF = TreeTests.setMetadata(varianceData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val transformVarDF = dt.fit(varianceDF).transform(varianceDF) +val calculatedVariances = transformVarDF.select(dt.getVarianceCol).collect().map { --- End diff -- I wonder if `toDS` worked here and you'd have `variance` "simpler". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69422844 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- You are right, please ignore my last comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69419165 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- If you would like to reduce the number of warnings, then this should be kept as is (unless I am misunderstanding something) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69410786 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -36,10 +37,21 @@ class DecisionTreeRegressorSuite private var categoricalDataPointsRDD: RDD[LabeledPoint] = _ + private var toyData: RDD[LabeledPoint] = _ + override def beforeAll() { super.beforeAll() + categoricalDataPointsRDD = sc.parallelize(OldDecisionTreeSuite.generateCategoricalDataPoints().map(_.asML)) +toyData = sc.parallelize(Seq( --- End diff -- Move ```toyData``` to ```TreeTests```. You can refer [Feature importance with toy data](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala#L108). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69410553 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- I'd like to remove the explicit setting since the default value(32) meets your needs. We want to make Jenkins logging clean and reduce the number of warnings if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69363260 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- "Explicit is better than implicit" ;) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69350493 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,22 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val calculatedVariances = dt.fit(toyDF).transform(toyDF).select("variance").collect().map { --- End diff -- Can we use `.select(dt.getVarianceCol)` instead? Also, the closing brace should go on the next line, like: ```scala .map { case Row(...) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69348071 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- Ok, not sure it needs to be set since the test is fine without it. The warning is more just helpful info logging. I'm fine to leave it since it doesn't matter either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69333966 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- I verified and my intuition was correct. I get this warning for the default setting: WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 6 (= number of training instances) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user MechCoder commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69332114 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- Because there are 6 datapoints, and I want each datapoint to be a split. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69298953 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val expectVariances = dt.fit(toyDF).transform(toyDF).select("variance").collect().map { + case Row(variance: Double) => variance } +val trueVariances = Array(0.667, 0.667, 0.667, 2.667, 2.667, 2.667) +trueVariances.zip(expectVariances).foreach(x => x._1 ~== x._2 absTol 1e-3) --- End diff -- Although this technically works, it is less confusing if use `assert` and unpack the tuple. Like ```scala ...foreach { case (actual, expected) => assert(actual ~== expected absTol 1e-3) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69298648 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) + .setSeed(0) +val expectVariances = dt.fit(toyDF).transform(toyDF).select("variance").collect().map { --- End diff -- `expectedVariances` and `trueVariances` are mixed up here. Expected should be the theoretical value computed below. Also, it would be good to leave a comment explaining where those expected values came from. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/13981#discussion_r69298384 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala --- @@ -96,6 +108,15 @@ class DecisionTreeRegressorSuite assert(variance === expectedVariance, s"Expected variance $expectedVariance but got $variance.") } + +val toyDF = TreeTests.setMetadata(toyData, Map.empty[Int, Int], 0) +dt.setMaxDepth(1) + .setMaxBins(6) --- End diff -- Not sure why we need to set maxBins here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13981: [SPARK-16307] [ML] Add test to verify the predict...
GitHub user MechCoder opened a pull request: https://github.com/apache/spark/pull/13981 [SPARK-16307] [ML] Add test to verify the predicted variances of a DT on toy data ## What changes were proposed in this pull request? The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree. ## How was this patch tested? The patch is a test You can merge this pull request into a Git repository by running: $ git pull https://github.com/MechCoder/spark dt_variance Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13981.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13981 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org