[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326590#comment-14326590
 ] 

Apache Spark commented on SPARK-5436:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/4677

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-18 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325829#comment-14325829
 ] 

Manoj Kumar commented on SPARK-5436:


The idea sounds great. I shall come up with a Pull Request in a day or two.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323254#comment-14323254
 ] 

Chris T commented on SPARK-5436:


That sounds like a good idea to me, with the caveat that if the 
convergenceTolerance was set to 0, then the algorithm runs until the full 
number of boosting iterations has been reached. This way users could iterate 
until convergence, or just build a model with N trees. Both seem like 
reasonable use-cases.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323248#comment-14323248
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

I think it would be nice to have a stopping criterion, but it should be more 
like a convergence tolerance than a target error rate (since that can't be 
known a priori, as [~ChrisT] said).  The test error of each iteration's model 
should be compared with the error from the previous iteration.  If it ever 
decreases by less than convergenceTol, then we stop.  I'd vote for 0 or 
something small like 1e-5 for a default value.  How does that sound?


 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323335#comment-14323335
 ] 

Chris T commented on SPARK-5436:


Aha, that's a neat solution. I like it! 

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323269#comment-14323269
 ] 

Chris T commented on SPARK-5436:


I think we need to allow the use-case where the user specifies the number of 
iterations to run, and doesn't care about whether the model is overfitting. How 
would this be implemented?

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323282#comment-14323282
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

If they call train/fit with only a training RDD, then it will not check for 
overfitting.  We could provide a helper function for computing the error rate 
on a new dataset at each iteration in GradientBoostedTreesModel.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323317#comment-14323317
 ] 

Chris T commented on SPARK-5436:


There is already a predict method in the model object, so in principle this can 
already be achieved. Currently we are iteratively extracting sub-models (with 
one additional tree in the model per iteration), calling predict() on the 
sub-model, and calculating the error (in our case MSE for a regression model). 
I think the helper function you're proposing does just this, right?

It seemed to me that, since the error is calculated internally while the model 
is being built, it is essentially free to just store this number as the model 
builds. But fair enough if you don't want to add complexity to the API, or 
confusion on differing use cases. I don't have a good sense of how small the 
cost is to do the error calculation after the fact, but for large datasets it 
may be non-trivial.

In any case, I think some of this discussion is fairly academic. :)

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323293#comment-14323293
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

The cost of computing the error after training, rather than caching it during 
training, seems negligible (since tree training takes much longer than 
prediction).  I'd vote for keeping the API simple, rather than adding options 
which could be handled using the existing API.  If users find that prediction 
takes as long as training, then we should investigate.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323289#comment-14323289
 ] 

Chris T commented on SPARK-5436:


I thought about this too, but I think there are cases where a user might wish 
to build a model with N trees, and examine the error rate after the fact. If, 
for example, we wer worried about finding global vs local minima, or we wanted 
to asses the rate at which a model started to overfit, or we wanted to do some 
kind testing. 

There are valid reasons why we might want both a specified number of trees, but 
also have the model scoring independently against a testData RDD during build 
phase. It seems both of these cases could easily be supported concurrently.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323323#comment-14323323
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

Yep, that sounds like what I had in mind:
{code}
  def evaluateEachIteration(data: RDD[LabeledPoint], evaluator or maybe use 
training metric): Array[Double]
{code}
where it essentially calls predict() once but keeps the intermediate results 
after each boosting stage, so that it runs in the same big-O time as predict().

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-12 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318915#comment-14318915
 ] 

Chris T commented on SPARK-5436:


I'm haven't been able to make headway on this. [~MechCoder], I suggest you take 
this on. 

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-12 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021
 ] 

Manoj Kumar commented on SPARK-5436:


Hi, I would like to give this a go. [~ChrisT] are you still working on this? 
Otherwise I would love to carry this forward.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-01-27 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294403#comment-14294403
 ] 

Chris T commented on SPARK-5436:


I think, then, the only addition needed is to retain the mean loss on every 
iteration. This is computed and emitted to the log on each build iteration:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala#L179

The question then becomes where to store this error value. Is it a property of 
the tree or the model? For a DecisionTree, I can see how the concept of the 
error applies. For a random forest, since each tree is independent of the 
others, that also makes sense. But for a GBT model, the model for N trees is 
dependent on the model with N-1 trees, so if I extract the Nth tree and request 
the error value, I have to be aware that this is not the error for this tree 
alone. I suspect this is fine...anyone building a GBT model would likely 
understand this. It's jsut a little weird to store a property of an object that 
is dependent on other objects in the ensemble.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294523#comment-14294523
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

That sound good.  I think the main challenge in this JIRA is specifying the 
API for passing 2 datasets to the algorithm instead of 1.  Basically, it will 
be good to make sure that other algorithms can follow a similar API.  Some 
possibilities are:
* Pass in a pair of RDDs, one for training and one for validation.
* Pass in 1 RDD, plus parameters for how to select a random subsample for 
validation.
I vote for the first option since it is more flexible than the 2nd.

Another question is whether to pass in a separate validation metric.  I vote 
for not allowing this since the API could always be extended later on.

So...it sounds like a simple API but may get some discussion from other 
reviewers.

Would you be interested in working on this?

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-01-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294360#comment-14294360
 ] 

Joseph K. Bradley commented on SPARK-5436:
--

Yes, it would be reasonable to take the same Loss (metric) which GBT tries to 
minimize on the training set and re-use that Loss for validation.  (Eventually, 
we could let the user specify a different metric, but I vote for keeping it 
simple for now.)

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-01-27 Thread Chris T (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294335#comment-14294335
 ] 

Chris T commented on SPARK-5436:


The usual way that GBT models are evaluated is by calculating an error metric 
on a hold-out test/validation data set. The error metric is often something 
simple, like Mean Squared Error. When a plot of MSE vs Model NumTrees is made, 
we typically see something like this:
https://dynamicecology.files.wordpress.com/2013/08/calibvalid2.jpg

In the early stages, the model predictions improve. After the model passes the 
optimal number of trees, the predictions degrade, due to the model overfitting. 
At the moment, one solution to obtain this information has been to extract the 
trees from the model (GradientBoostedTreeModel.trees returns an Array of 
DecisionTreeModel), and iteratively recreate a sub-model, scoring the test data 
against each submodel. This is fairly expensive. 

Is there a model error metric that is calculated internally (e.g. by the 
gradient descent algorithm)? If this was retained, I think there would be a lot 
of value. Ideally, it would retain the model error for each tree during the 
build phase. It would then be fairly trivial to create a submodel that yields 
optimal predictions.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org