Hi All, I wonder if anyone has any experience with building Gradient Boosted Tree models in MLlib, and can help me out. I'm trying to create a plot of the test error rate of a Gradient Boosted Tree model as a function of number of trees, to determine the optimal number of trees in the model. Does spark calculate (and store!) the error rate on each iteration of model building? Can I get at those values somehow? Alternatively, having constructed a model, is it possible to score with only a fixed number of trees? e.g. I built a model with 1000 trees, but I only want to score the data with the first 100 trees. I could calculate the needed quantities by hand if I could do that in some way.
The optimal number of trees in a GBM is typically determined by calculating the mean standard error on each iteration when building the model. The final model is then considered "optimal" when the MSE is minimum. i.e. in a plot of MSE vs Number of trees, the error rate will decrease (as the model improves), hit a minimum (the optimal point), and then increase (as the model starts to overfit the data). cheers chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square Sydney NSW 2000 T: +61 2 8222 3577 F: +61 2 9292 6444 W: quantium.com.au<www.quantium.com.au> ________________________________ linkedin.com/company/quantium<www.linkedin.com/company/quantium> facebook.com/QuantiumAustralia<www.facebook.com/QuantiumAustralia> twitter.com/QuantiumAU<www.twitter.com/QuantiumAU> The contents of this email, including attachments, may be confidential information. If you are not the intended recipient, any use, disclosure or copying of the information is unauthorised. If you have received this email in error, we would be grateful if you would notify us immediately by email reply, phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from your system.