[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21097


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186865863
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -367,11 +367,31 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 
   test("model evaluateEachIteration") {
 val gbt = new GBTClassifier()
+  .setSeed(1L)
   .setMaxDepth(2)
-  .setMaxIter(2)
-val model = gbt.fit(trainData.toDF)
-val eval = model.evaluateEachIteration(validationData.toDF)
-assert(Vectors.dense(eval) ~== Vectors.dense(1.7641, 1.8209) relTol 
1E-3)
+  .setMaxIter(3)
+  .setLossType("logistic")
+val model3 = gbt.fit(trainData.toDF)
+val model1 = new GBTClassificationModel("gbt-cls-model-test1",
+  model3.trees.take(1), model3.treeWeights.take(1), 
model3.numFeatures, model3.numClasses)
+val model2 = new GBTClassificationModel("gbt-cls-model-test2",
+  model3.trees.take(2), model3.treeWeights.take(2), 
model3.numFeatures, model3.numClasses)
+
+for (evalLossType <- GBTClassifier.supportedLossTypes) {
--- End diff --

evalLossType is not used, so I'd remove this loop.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186565756
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -269,6 +269,20 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   *
+   * @param dataset Dataset for validation.
--- End diff --

Add doc for "loss" arg, including what the options are


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-05-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r186037589
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
+  val gbt = new GBTClassifier()
+.setMaxDepth(2)
+.setMaxIter(2)
+.setLossType(lossType)
+  val model = gbt.fit(trainData.toDF)
+  val eval1 = model.evaluateEachIteration(validationData.toDF)
+  val eval2 = 
GradientBoostedTrees.evaluateEachIteration(validationData,
--- End diff --

I search scikit-learn doc, there seems no similar method like 
`evaluateEachIteration`, we can only use `staged_predict` in 
`sklearn.ensemble.GradientBoostingRegressor` and then implement almost the 
whole logic again. In R package I also do not find this method.
Now I update the unit test, to compare with hardcoded result.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r184760717
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
+  val gbt = new GBTClassifier()
+.setMaxDepth(2)
+.setMaxIter(2)
+.setLossType(lossType)
+  val model = gbt.fit(trainData.toDF)
+  val eval1 = model.evaluateEachIteration(validationData.toDF)
+  val eval2 = 
GradientBoostedTrees.evaluateEachIteration(validationData,
--- End diff --

This is testing the spark.ml implementation against itself.  I was about to 
recommend using the old spark.mllib implementation as a reference.   However, 
the old implementation is not tested at all.  Would you be able to test against 
a standard implementation in R or scikit-learn (following the patterns used 
elsewhere in MLlib)?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r184763088
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala ---
@@ -269,6 +269,21 @@ class GBTRegressionModel private[ml](
 new OldGBTModel(OldAlgo.Regression, _trees.map(_.toOld), _treeWeights)
   }
 
+  /**
+   * Method to compute error or loss for every iteration of gradient 
boosting.
+   *
+   * @param dataset Dataset for validation.
+   */
+  @Since("2.4.0")
+  def evaluateEachIteration(dataset: Dataset[_]): Array[Double] = {
--- End diff --

Do we want to support evaluation on other losses, as in the old API?  It 
might be nice to be able to without having to modify the Model's loss Param 
value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-19 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r182829257
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
--- End diff --

OK. It makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-19 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r182668410
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
--- End diff --

Yes. But I think it can fit for future, if we add more loss type for GBT 
classifier.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-18 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21097#discussion_r182603253
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +365,20 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("model evaluateEachIteration") {
+for (lossType <- Seq("logistic")) {
--- End diff --

there is only one lossType. `for` is not necessary.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21097: [SPARK-14682][ML] Provide evaluateEachIteration m...

2018-04-18 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21097

[SPARK-14682][ML] Provide evaluateEachIteration method or equivalent for 
spark.ml GBTs

## What changes were proposed in this pull request?

Provide evaluateEachIteration method or equivalent for spark.ml GBTs.

## How was this patch tested?

UT.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark GBTeval

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21097.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21097


commit 836d760ba89713223d719adcef4c0d47dd306f41
Author: WeichenXu 
Date:   2018-04-18T11:46:17Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org