Agree Gilles

Which is why I later changed to max_features = None, but 6 is a good value,
sqrt(36) ~=sqrt(30) and we had 30 features.
Generally speaking, if I have 100 estimators (this is from previous
experience and also the auto setting on your GBC) and 30 features, 6 should
be a good start.
But evidently it was not.

I am interested to know how exactly max features, or the random_state,
affects the way Gradient Boosting is implemented.
Can you please write couple of lines. I will be grateful.

Regards
Deb

On Tue, Sep 16, 2014 at 4:20 PM, Gilles Louppe <[email protected]> wrote:

> Hi Deb,
>
> In your case, randomness comes from the max_features=6 setting, which
> makes the model not very stable from one execution to another, since
> the original dataset includes about 5x more input variables.
>
> Gilles
>
> On 16 September 2014 12:40, Debanjan Bhattacharyya <[email protected]>
> wrote:
> > Thanks Arnaud
> >
> > random_state is not listed as a parameter on
> >
> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
> > page.
> > But it is listed as an argument in the constructor. Its my fault
> probably -
> > that I did not notice it as a passable parameter. May be the
> documentation
> > can be changed.
> >
> > In hind sight, and as a generic approach, if I am training without
> > random_state, why and when would the boosted models vary highly ? (I have
> > seen data sets where they don't) ?
> > And what should be the right approach on having stable CV ? Not using
> > random_state and doing several rounds of CV and averaging it ? or using
> > different random_states
> > and doing several rounds of CV and averaging it ?
> >
> > What exactly goes behind random_state from a Gradient Boosting approach ?
> >
> > Regards
> > Deb
> >
> > On Tue, Sep 16, 2014 at 3:52 PM, Arnaud Joly <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >>
> >> To get reproducible model, you have to set the random_state.
> >>
> >> Best regards,
> >> Arnaud
> >>
> >>
> >> On 16 Sep 2014, at 12:08, Debanjan Bhattacharyya <[email protected]>
> >> wrote:
> >>
> >> Hi I recently participated in the Atlas (Higgs Boson Machine Learning
> >> Challenge)
> >>
> >> One of the models I tried was GradientBoostingClassifier. I found it
> >> extremely non deterministic.
> >> So if I use
> >>
> >> est = GradientBoostingClassifier(n_estimators=100,
> >> max_depth=10,min_samples_leaf=20,max_features=6,verbose=1)
> >>
> >> and train several times on the same training set (full). I end up having
> >> models (significantly different in size - I mean pickle output) which
> >> predict differently on the same instance. The difference is on the
> scale of
> >> 20 to 30% (so I have seen values varying between 0.7x and 0.4x) on the
> same
> >> instance. Even the (ordering) top 20 features (out of 30) differ from
> model
> >> to model quite significantly.
> >>
> >> Can someone tell me a bit more in details about this uncertainty.
> >>
> >> The train data set can be downloaded from here
> >> https://www.kaggle.com/c/higgs-boson/data
> >>
> >>
> >> Thanks
> >>
> >> Regards
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Want excitement?
> >> Manually upgrade your production database.
> >> When you want reliability, choose Perforce.
> >> Perforce version control. Predictably reliable.
> >>
> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Want excitement?
> >> Manually upgrade your production database.
> >> When you want reliability, choose Perforce.
> >> Perforce version control. Predictably reliable.
> >>
> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce.
> > Perforce version control. Predictably reliable.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to