Two of the most important parameters of the gradient boosting classifier
are the learn_rate and n_estimators. In order to set these, the
documentation states:
[HTF2009] <http://scikit-learn.org/dev/modules/ensemble.html#htf2009>recommend
to set the learning rate to a small constant (e.g.
> learn_rate <= 0.1) and choose n_estimators by early stopping
Although the documentation for this classifier is in general good
(thanks!), I didn't see how to perform this early stopping. The examples
do makes do make it clear how I can first fit the classifier for a large
n_estimators value, and subsequently look back and see if I would have
gotten better results with fewer trees. However, that's rather inefficient
- to be efficient, I'd like to stop fitting additional trees as soon as the
accuracy stops improving substantially.
I suppose it's something like the following:
X_train, X_test = X[:2000], X[2000:]y_train, y_test = y[:2000], y[2000:]
n_additional_trees = [10, 90, 900, 9000, 90000]
clf = ensemble.GradientBoostingClassifier(learn_rate=0.005,
n_estimators = n_additional_trees.pop(0), subsample=0.5)
clf.fit(X_train, X_test)
previous_error = 1.0
current_error = clf.loss_(y_test, y_pred)
while (previous_error - current_error) > 0.01:
previous_error = current_error
for additional_tree in range(n_additional_trees.pop(0)):
clf.fit_stage(UNDOCUMENTED_MYSTERY_PARAMS)
current_error = clf.loss(y_test, y_pred)
What I want the above code to do is run the boosting classifier first
with 10 trees, then with 100, 1000, and 10000 trees. It will stop at
any of these breakpoints if the improvement in accuracy is less than
0.01. Is this code basically correct - i.e., is this what is meant by
"early stopping"?
One problem is that the parameters for the fit_stage method are not
documented (here I just used the placeholder
UNDOCUMENTED_MYSTERY_PARAMS). I'll have a closer look at the source
code to try to figure out what belongs here, but ideally this method
would have better documentation.
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general