[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326590#comment-14326590 ] Apache Spark commented on SPARK-5436: - User 'MechCoder' has created a pull request for this issue: https://github.com/apache/spark/pull/4677 Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325829#comment-14325829 ] Manoj Kumar commented on SPARK-5436: The idea sounds great. I shall come up with a Pull Request in a day or two. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323254#comment-14323254 ] Chris T commented on SPARK-5436: That sounds like a good idea to me, with the caveat that if the convergenceTolerance was set to 0, then the algorithm runs until the full number of boosting iterations has been reached. This way users could iterate until convergence, or just build a model with N trees. Both seem like reasonable use-cases. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323248#comment-14323248 ] Joseph K. Bradley commented on SPARK-5436: -- I think it would be nice to have a stopping criterion, but it should be more like a convergence tolerance than a target error rate (since that can't be known a priori, as [~ChrisT] said). The test error of each iteration's model should be compared with the error from the previous iteration. If it ever decreases by less than convergenceTol, then we stop. I'd vote for 0 or something small like 1e-5 for a default value. How does that sound? Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323335#comment-14323335 ] Chris T commented on SPARK-5436: Aha, that's a neat solution. I like it! Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323269#comment-14323269 ] Chris T commented on SPARK-5436: I think we need to allow the use-case where the user specifies the number of iterations to run, and doesn't care about whether the model is overfitting. How would this be implemented? Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323282#comment-14323282 ] Joseph K. Bradley commented on SPARK-5436: -- If they call train/fit with only a training RDD, then it will not check for overfitting. We could provide a helper function for computing the error rate on a new dataset at each iteration in GradientBoostedTreesModel. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323317#comment-14323317 ] Chris T commented on SPARK-5436: There is already a predict method in the model object, so in principle this can already be achieved. Currently we are iteratively extracting sub-models (with one additional tree in the model per iteration), calling predict() on the sub-model, and calculating the error (in our case MSE for a regression model). I think the helper function you're proposing does just this, right? It seemed to me that, since the error is calculated internally while the model is being built, it is essentially free to just store this number as the model builds. But fair enough if you don't want to add complexity to the API, or confusion on differing use cases. I don't have a good sense of how small the cost is to do the error calculation after the fact, but for large datasets it may be non-trivial. In any case, I think some of this discussion is fairly academic. :) Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323293#comment-14323293 ] Joseph K. Bradley commented on SPARK-5436: -- The cost of computing the error after training, rather than caching it during training, seems negligible (since tree training takes much longer than prediction). I'd vote for keeping the API simple, rather than adding options which could be handled using the existing API. If users find that prediction takes as long as training, then we should investigate. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323289#comment-14323289 ] Chris T commented on SPARK-5436: I thought about this too, but I think there are cases where a user might wish to build a model with N trees, and examine the error rate after the fact. If, for example, we wer worried about finding global vs local minima, or we wanted to asses the rate at which a model started to overfit, or we wanted to do some kind testing. There are valid reasons why we might want both a specified number of trees, but also have the model scoring independently against a testData RDD during build phase. It seems both of these cases could easily be supported concurrently. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323323#comment-14323323 ] Joseph K. Bradley commented on SPARK-5436: -- Yep, that sounds like what I had in mind: {code} def evaluateEachIteration(data: RDD[LabeledPoint], evaluator or maybe use training metric): Array[Double] {code} where it essentially calls predict() once but keeps the intermediate results after each boosting stage, so that it runs in the same big-O time as predict(). Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318915#comment-14318915 ] Chris T commented on SPARK-5436: I'm haven't been able to make headway on this. [~MechCoder], I suggest you take this on. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021 ] Manoj Kumar commented on SPARK-5436: Hi, I would like to give this a go. [~ChrisT] are you still working on this? Otherwise I would love to carry this forward. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294403#comment-14294403 ] Chris T commented on SPARK-5436: I think, then, the only addition needed is to retain the mean loss on every iteration. This is computed and emitted to the log on each build iteration: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L179 The question then becomes where to store this error value. Is it a property of the tree or the model? For a DecisionTree, I can see how the concept of the error applies. For a random forest, since each tree is independent of the others, that also makes sense. But for a GBT model, the model for N trees is dependent on the model with N-1 trees, so if I extract the Nth tree and request the error value, I have to be aware that this is not the error for this tree alone. I suspect this is fine...anyone building a GBT model would likely understand this. It's jsut a little weird to store a property of an object that is dependent on other objects in the ensemble. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294523#comment-14294523 ] Joseph K. Bradley commented on SPARK-5436: -- That sound good. I think the main challenge in this JIRA is specifying the API for passing 2 datasets to the algorithm instead of 1. Basically, it will be good to make sure that other algorithms can follow a similar API. Some possibilities are: * Pass in a pair of RDDs, one for training and one for validation. * Pass in 1 RDD, plus parameters for how to select a random subsample for validation. I vote for the first option since it is more flexible than the 2nd. Another question is whether to pass in a separate validation metric. I vote for not allowing this since the API could always be extended later on. So...it sounds like a simple API but may get some discussion from other reviewers. Would you be interested in working on this? Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294360#comment-14294360 ] Joseph K. Bradley commented on SPARK-5436: -- Yes, it would be reasonable to take the same Loss (metric) which GBT tries to minimize on the training set and re-use that Loss for validation. (Eventually, we could let the user specify a different metric, but I vote for keeping it simple for now.) Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294335#comment-14294335 ] Chris T commented on SPARK-5436: The usual way that GBT models are evaluated is by calculating an error metric on a hold-out test/validation data set. The error metric is often something simple, like Mean Squared Error. When a plot of MSE vs Model NumTrees is made, we typically see something like this: https://dynamicecology.files.wordpress.com/2013/08/calibvalid2.jpg In the early stages, the model predictions improve. After the model passes the optimal number of trees, the predictions degrade, due to the model overfitting. At the moment, one solution to obtain this information has been to extract the trees from the model (GradientBoostedTreeModel.trees returns an Array of DecisionTreeModel), and iteratively recreate a sub-model, scoring the test data against each submodel. This is fairly expensive. Is there a model error metric that is calculated internally (e.g. by the gradient descent algorithm)? If this was retained, I think there would be a lot of value. Ideally, it would retain the model error for each tree during the build phase. It would then be fairly trivial to create a submodel that yields optimal predictions. Validate GradientBoostedTrees during training - Key: SPARK-5436 URL: https://issues.apache.org/jira/browse/SPARK-5436 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley For Gradient Boosting, it would be valuable to compute test error on a separate validation set during training. That way, training could stop early based on the test error (or some other metric specified by the user). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org