[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21097 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r186865863 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -367,11 +367,31 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { test("model evaluateEachIteration") { val gbt = new GBTClassifier() + .setSeed(1L) .setMaxDepth(2) - .setMaxIter(2) -val model = gbt.fit(trainData.toDF) -val eval = model.evaluateEachIteration(validationData.toDF) -assert(Vectors.dense(eval) ~== Vectors.dense(1.7641, 1.8209) relTol 1E-3) + .setMaxIter(3) + .setLossType("logistic") +val model3 = gbt.fit(trainData.toDF) +val model1 = new GBTClassificationModel("gbt-cls-model-test1", + model3.trees.take(1), model3.treeWeights.take(1), model3.numFeatures, model3.numClasses) +val model2 = new GBTClassificationModel("gbt-cls-model-test2", + model3.trees.take(2), model3.treeWeights.take(2), model3.numFeatures, model3.numClasses) + +for (evalLossType <- GBTClassifier.supportedLossTypes) { --- End diff -- evalLossType is not used, so I'd remove this loop. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r186565756 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -269,6 +269,20 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. --- End diff -- Add doc for "loss" arg, including what the options are --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r186037589 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { + val gbt = new GBTClassifier() +.setMaxDepth(2) +.setMaxIter(2) +.setLossType(lossType) + val model = gbt.fit(trainData.toDF) + val eval1 = model.evaluateEachIteration(validationData.toDF) + val eval2 = GradientBoostedTrees.evaluateEachIteration(validationData, --- End diff -- I search scikit-learn doc, there seems no similar method like `evaluateEachIteration`, we can only use `staged_predict` in `sklearn.ensemble.GradientBoostingRegressor` and then implement almost the whole logic again. In R package I also do not find this method. Now I update the unit test, to compare with hardcoded result. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r184760717 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { + val gbt = new GBTClassifier() +.setMaxDepth(2) +.setMaxIter(2) +.setLossType(lossType) + val model = gbt.fit(trainData.toDF) + val eval1 = model.evaluateEachIteration(validationData.toDF) + val eval2 = GradientBoostedTrees.evaluateEachIteration(validationData, --- End diff -- This is testing the spark.ml implementation against itself. I was about to recommend using the old spark.mllib implementation as a reference. However, the old implementation is not tested at all. Would you be able to test against a standard implementation in R or scikit-learn (following the patterns used elsewhere in MLlib)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r184763088 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala --- @@ -269,6 +269,21 @@ class GBTRegressionModel private[ml]( new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights) } + /** + * Method to compute error or loss for every iteration of gradient boosting. + * + * @param dataset Dataset for validation. + */ + @Since("2.4.0") + def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = { --- End diff -- Do we want to support evaluation on other losses, as in the old API? It might be nice to be able to without having to modify the Model's loss Param value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r182829257 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { --- End diff -- OK. It makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r182668410 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { --- End diff -- Yes. But I think it can fit for future, if we add more loss type for GBT classifier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
Github user wangmiao1981 commented on a diff in the pull request: https://github.com/apache/spark/pull/21097#discussion_r182603253 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("model evaluateEachIteration") { +for (lossType <- Seq("logistic")) { --- End diff -- there is only one lossType. `for` is not necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/21097 [SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for spark.ml GBTs ## What changes were proposed in this pull request? Provide evaluateEachIteration method or equivalent for spark.ml GBTs. ## How was this patch tested? UT. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark GBTeval Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21097.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21097 commit 836d760ba89713223d719adcef4c0d47dd306f41 Author: WeichenXuDate: 2018-04-18T11:46:17Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org