Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Arnaud Joly Tue, 16 Sep 2014 09:45:09 -0700

Of you set the random state and put the same parameters, you are expected to 
have
exactly the same model. To be concrete, if you do


est_1 = GradientBoostingClassifie(random_state=0)
est.fit(X, y)

est_2 = GradientBoostingClassifie(random_state=0)
est.fit(X, y)

est_3 = GradientBoostingClassifie(random_state=1)
est.fit(X, y)

Then model est_1 and est_2 will be identical, while model est_1 and est_3 will
be different (but were learnt with the same parameters). You are expected to 
get similar 
performance up to the bias and variance of the model due to learning set and 
algorithm
randomisation. To learn more, you might want to check this example
 
http://scikit-learn.org/stable/auto_examples/ensemble/plot_bias_variance.html#example-ensemble-plot-bias-variance-py

Best regards,
Arnaud

On 16 Sep 2014, at 17:52, Debanjan Bhattacharyya <[email protected]> wrote:

> Thanks Arnaud
> Got it.
> 
> Essentially what you are saying is
> While training classifier A, imagine there was a tie at estimator 3, on 2 
> features sets, e.g S1[12,3,4,5,6] and S2[2,3,4,5,6,7]. And S1 was chosen
> While training classifier B, there was a tie again at estimator 3 on the same 
> sets and S2 was chosen.
> Now while predicting on an instance, with classifiers A and B, the feature 
> values of S1 sends the prediction path/value into a different split when 
> compared to S2.
> If I understand that correctly. There should not be a problem if I train on 
> some data and predict on the same data too. Right ? In that case if there is 
> a tie while training, the same tie should be there while predicting. But I 
> have seen that problem too. Or did I understand it incorrectly ?
> 
> Also, I guess this random_seed will affect whether I have a small max 
> features or large - right ?
> 
> Regards
> Deb
> 
> On Tue, Sep 16, 2014 at 6:07 PM, Arnaud Joly <[email protected]> wrote:
> During the growth of the decision tree, the best split is searched in a subset
> of max_features sampled among all features.
> 
> Setting the random_state allows to draw the same subsets of features each 
> time.
> Note that if several candidate splits have the same score, ties are broken 
> randomly. Setting the random_state allows to have deterministic tie break.
> 
> Best regards,
> Arnaud
> 
> 
> 
> On 16 Sep 2014, at 13:03, Debanjan Bhattacharyya <[email protected]> wrote:
> 
>> Agree Gilles
>> 
>> Which is why I later changed to max_features = None, but 6 is a good value, 
>> sqrt(36) ~=sqrt(30) and we had 30 features.
>> Generally speaking, if I have 100 estimators (this is from previous 
>> experience and also the auto setting on your GBC) and 30 features, 6 should 
>> be a good start.
>> But evidently it was not.
>> 
>> I am interested to know how exactly max features, or the random_state, 
>> affects the way Gradient Boosting is implemented.
>> Can you please write couple of lines. I will be grateful.
>> 
>> Regards
>> Deb
>> 
>> On Tue, Sep 16, 2014 at 4:20 PM, Gilles Louppe <[email protected]> wrote:
>> Hi Deb,
>> 
>> In your case, randomness comes from the max_features=6 setting, which
>> makes the model not very stable from one execution to another, since
>> the original dataset includes about 5x more input variables.
>> 
>> Gilles
>> 
>> On 16 September 2014 12:40, Debanjan Bhattacharyya <[email protected]> 
>> wrote:
>> > Thanks Arnaud
>> >
>> > random_state is not listed as a parameter on
>> > http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
>> > page.
>> > But it is listed as an argument in the constructor. Its my fault probably -
>> > that I did not notice it as a passable parameter. May be the documentation
>> > can be changed.
>> >
>> > In hind sight, and as a generic approach, if I am training without
>> > random_state, why and when would the boosted models vary highly ? (I have
>> > seen data sets where they don't) ?
>> > And what should be the right approach on having stable CV ? Not using
>> > random_state and doing several rounds of CV and averaging it ? or using
>> > different random_states
>> > and doing several rounds of CV and averaging it ?
>> >
>> > What exactly goes behind random_state from a Gradient Boosting approach ?
>> >
>> > Regards
>> > Deb
>> >
>> > On Tue, Sep 16, 2014 at 3:52 PM, Arnaud Joly <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >>
>> >> To get reproducible model, you have to set the random_state.
>> >>
>> >> Best regards,
>> >> Arnaud
>> >>
>> >>
>> >> On 16 Sep 2014, at 12:08, Debanjan Bhattacharyya <[email protected]>
>> >> wrote:
>> >>
>> >> Hi I recently participated in the Atlas (Higgs Boson Machine Learning
>> >> Challenge)
>> >>
>> >> One of the models I tried was GradientBoostingClassifier. I found it
>> >> extremely non deterministic.
>> >> So if I use
>> >>
>> >> est = GradientBoostingClassifier(n_estimators=100,
>> >> max_depth=10,min_samples_leaf=20,max_features=6,verbose=1)
>> >>
>> >> and train several times on the same training set (full). I end up having
>> >> models (significantly different in size - I mean pickle output) which
>> >> predict differently on the same instance. The difference is on the scale 
>> >> of
>> >> 20 to 30% (so I have seen values varying between 0.7x and 0.4x) on the 
>> >> same
>> >> instance. Even the (ordering) top 20 features (out of 30) differ from 
>> >> model
>> >> to model quite significantly.
>> >>
>> >> Can someone tell me a bit more in details about this uncertainty.
>> >>
>> >> The train data set can be downloaded from here
>> >> https://www.kaggle.com/c/higgs-boson/data
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Regards
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Want excitement?
>> >> Manually upgrade your production database.
>> >> When you want reliability, choose Perforce.
>> >> Perforce version control. Predictably reliable.
>> >>
>> >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Want excitement?
>> >> Manually upgrade your production database.
>> >> When you want reliability, choose Perforce.
>> >> Perforce version control. Predictably reliable.
>> >>
>> >> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> >> _______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce.
>> > Perforce version control. Predictably reliable.
>> > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> 
>> ------------------------------------------------------------------------------
>> Want excitement?
>> Manually upgrade your production database.
>> When you want reliability, choose Perforce.
>> Perforce version control. Predictably reliable.
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> ------------------------------------------------------------------------------
>> Want excitement?
>> Manually upgrade your production database.
>> When you want reliability, choose Perforce.
>> Perforce version control. Predictably reliable.
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Reply via email to