Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Arnaud Joly Tue, 16 Sep 2014 05:39:11 -0700

During the growth of the decision tree, the best split is searched in a subset
of max_features sampled among all features.


Setting the random_state allows to draw the same subsets of features each time.
Note that if several candidate splits have the same score, ties are broken 
randomly. Setting the random_state allows to have deterministic tie break.

Best regards,
Arnaud


On 16 Sep 2014, at 13:03, Debanjan Bhattacharyya <[email protected]> wrote:

> Agree Gilles
> 
> Which is why I later changed to max_features = None, but 6 is a good value, 
> sqrt(36) ~=sqrt(30) and we had 30 features.
> Generally speaking, if I have 100 estimators (this is from previous 
> experience and also the auto setting on your GBC) and 30 features, 6 should 
> be a good start.
> But evidently it was not.
> 
> I am interested to know how exactly max features, or the random_state, 
> affects the way Gradient Boosting is implemented.
> Can you please write couple of lines. I will be grateful.
> 
> Regards
> Deb
> 
> On Tue, Sep 16, 2014 at 4:20 PM, Gilles Louppe <[email protected]> wrote:
> Hi Deb,
> 
> In your case, randomness comes from the max_features=6 setting, which
> makes the model not very stable from one execution to another, since
> the original dataset includes about 5x more input variables.
> 
> Gilles
> 
> On 16 September 2014 12:40, Debanjan Bhattacharyya <[email protected]> 
> wrote:
> > Thanks Arnaud
> >
> > random_state is not listed as a parameter on
> > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
> > page.
> > But it is listed as an argument in the constructor. Its my fault probably -
> > that I did not notice it as a passable parameter. May be the documentation
> > can be changed.
> >
> > In hind sight, and as a generic approach, if I am training without
> > random_state, why and when would the boosted models vary highly ? (I have
> > seen data sets where they don't) ?
> > And what should be the right approach on having stable CV ? Not using
> > random_state and doing several rounds of CV and averaging it ? or using
> > different random_states
> > and doing several rounds of CV and averaging it ?
> >
> > What exactly goes behind random_state from a Gradient Boosting approach ?
> >
> > Regards
> > Deb
> >
> > On Tue, Sep 16, 2014 at 3:52 PM, Arnaud Joly <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >>
> >> To get reproducible model, you have to set the random_state.
> >>
> >> Best regards,
> >> Arnaud
> >>
> >>
> >> On 16 Sep 2014, at 12:08, Debanjan Bhattacharyya <[email protected]>
> >> wrote:
> >>
> >> Hi I recently participated in the Atlas (Higgs Boson Machine Learning
> >> Challenge)
> >>
> >> One of the models I tried was GradientBoostingClassifier. I found it
> >> extremely non deterministic.
> >> So if I use
> >>
> >> est = GradientBoostingClassifier(n_estimators=100,
> >> max_depth=10,min_samples_leaf=20,max_features=6,verbose=1)
> >>
> >> and train several times on the same training set (full). I end up having
> >> models (significantly different in size - I mean pickle output) which
> >> predict differently on the same instance. The difference is on the scale of
> >> 20 to 30% (so I have seen values varying between 0.7x and 0.4x) on the same
> >> instance. Even the (ordering) top 20 features (out of 30) differ from model
> >> to model quite significantly.
> >>
> >> Can someone tell me a bit more in details about this uncertainty.
> >>
> >> The train data set can be downloaded from here
> >> https://www.kaggle.com/c/higgs-boson/data
> >>
> >>
> >> Thanks
> >>
> >> Regards
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> Want excitement?
> >> Manually upgrade your production database.
> >> When you want reliability, choose Perforce.
> >> Perforce version control. Predictably reliable.
> >>
> >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> >> ------------------------------------------------------------------------------
> >> Want excitement?
> >> Manually upgrade your production database.
> >> When you want reliability, choose Perforce.
> >> Perforce version control. Predictably reliable.
> >>
> >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >
> >
> > ------------------------------------------------------------------------------
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce.
> > Perforce version control. Predictably reliable.
> > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> 
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Reply via email to