Hi All,

I wonder if anyone has any experience with building Gradient Boosted Tree 
models in MLlib, and can help me out. I'm trying to create a plot of the test 
error rate of a Gradient Boosted Tree model as a function of number of trees, 
to determine the optimal number of trees in the model. Does spark calculate 
(and store!) the error rate on each iteration of model building? Can I get at 
those values somehow? Alternatively, having constructed a model, is it possible 
to score with only a fixed number of trees? e.g. I built a model with 1000 
trees, but I only want to score the data with the first 100 trees. I could 
calculate the needed quantities by hand if I could do that in some way.

The optimal number of trees in a GBM is typically determined by calculating the 
mean standard error on each iteration when building the model. The final model 
is then considered "optimal" when the MSE is minimum. i.e. in a plot of MSE vs 
Number of trees, the error rate will decrease (as the model improves), hit a 
minimum (the optimal point), and then increase (as the model starts to overfit 
the data).

cheers
chris
Christopher Thom
QUANTIUM
Level 25, 8 Chifley, 8-12 Chifley Square
Sydney NSW 2000

T: +61 2 8222 3577
F: +61 2 9292 6444

W: quantium.com.au<www.quantium.com.au>

________________________________

linkedin.com/company/quantium<www.linkedin.com/company/quantium>

facebook.com/QuantiumAustralia<www.facebook.com/QuantiumAustralia>

twitter.com/QuantiumAU<www.twitter.com/QuantiumAU>


The contents of this email, including attachments, may be confidential 
information. If you are not the intended recipient, any use, disclosure or 
copying of the information is unauthorised. If you have received this email in 
error, we would be grateful if you would notify us immediately by email reply, 
phone (+ 61 2 9292 6400) or fax (+ 61 2 9292 6444) and delete the message from 
your system.

Reply via email to