Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4906#discussion_r26056473
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
    @@ -69,6 +74,42 @@ class GradientBoostedTrees(private val boostingStrategy: 
BoostingStrategy)
           case _ =>
             throw new IllegalArgumentException(s"$algo is not supported by the 
gradient boosting.")
         }
    +    baseLearners = fitGradientBoostingModel.trees
    +    baseLearnerWeights = fitGradientBoostingModel.treeWeights
    +    fitGradientBoostingModel
    +  }
    +
    +  /**
    +   * Method to compute error or loss for every iteration of gradient 
boosting.
    +   * @param data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]
    +   * @param loss: evaluation metric that defaults to boostingStrategy.loss
    +   * @return an array with index i having the losses or errors for the 
ensemble
    +   *         containing trees 1 to i + 1
    +   */
    +  def evaluateEachIteration(
    +      data: RDD[LabeledPoint],
    +      loss: Loss = boostingStrategy.loss) : Array[Double] = {
    +
    +    val algo = boostingStrategy.treeStrategy.algo
    +    val remappedData = algo match {
    +      case Classification => data.map(x => new LabeledPoint((x.label * 2) 
- 1, x.features))
    +      case _ => data
    +    }
    +    val initialTree = baseLearners(0)
    +    val evaluationArray = Array.fill(numIterations)(0.0)
    +
    +    // Initial weight is 1.0
    +    var predictionRDD = remappedData.map(i => 
initialTree.predict(i.features))
    +    evaluationArray(0) = loss.computeError(remappedData, predictionRDD)
    +
    +    (1 until numIterations).map {nTree =>
    --- End diff --
    
    This does numIterations maps, broadcasting the model numIterations times.  
I'd recommend using a broadcast variable for the model to make sure it's only 
sent once.
    
    You could keep the current approach pretty much as-is, but it does 
numIterations actions, so it's a bit inefficient.  You could optimize it by 
using only 1 map, but that would require modifying the computeError method as 
follows:
    * computeError could be overloaded to take (prediction: Double, datum: 
LabeledPoint).  This could replace the computeError method you implemented.
    * Here, in evaluateEachIteration, you could call predictionRDD.map, and 
within the map, for each data point, you could evaluate each tree on the data 
point, compute the prediction from each iteration via a cumulative sum, and 
then calling computeError on each prediction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to