Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Debanjan Bhattacharyya Tue, 16 Sep 2014 09:01:23 -0700

Thanks Arnaud
Got it.

Essentially what you are saying is
While training classifier A, imagine there was a tie at estimator 3, on 2
features sets, e.g S1[12,3,4,5,6] and S2[2,3,4,5,6,7]. And S1 was chosen
While training classifier B, there was a tie again at estimator 3 on the
same sets and S2 was chosen.
Now while predicting on an instance, with classifiers A and B, the feature
values of S1 sends the prediction path/value into a different split when
compared to S2.
If I understand that correctly. There should not be a problem if I train on
some data and predict on the same data too. Right ? In that case if there
is a tie while training, the same tie should be there while predicting. But
I have seen that problem too. Or did I understand it incorrectly ?


Also, I guess this random_seed will affect whether I have a small max
features or large - right ?

Regards
Deb

On Tue, Sep 16, 2014 at 6:07 PM, Arnaud Joly <[email protected]> wrote:

> During the growth of the decision tree, the best split is searched in a
> subset
> of max_features sampled among all features.
>
> Setting the random_state allows to draw the same subsets of features each
> time.
> Note that if several candidate splits have the same score, ties are broken
> randomly. Setting the random_state allows to have deterministic tie break.
>
> Best regards,
> Arnaud
>
>
>
> On 16 Sep 2014, at 13:03, Debanjan Bhattacharyya <[email protected]>
> wrote:
>
> Agree Gilles
>
> Which is why I later changed to max_features = None, but 6 is a good
> value, sqrt(36) ~=sqrt(30) and we had 30 features.
> Generally speaking, if I have 100 estimators (this is from previous
> experience and also the auto setting on your GBC) and 30 features, 6 should
> be a good start.
> But evidently it was not.
>
> I am interested to know how exactly max features, or the random_state,
> affects the way Gradient Boosting is implemented.
> Can you please write couple of lines. I will be grateful.
>
> Regards
> Deb
>
> On Tue, Sep 16, 2014 at 4:20 PM, Gilles Louppe <[email protected]> wrote:
>
>> Hi Deb,
>>
>> In your case, randomness comes from the max_features=6 setting, which
>> makes the model not very stable from one execution to another, since
>> the original dataset includes about 5x more input variables.
>>
>> Gilles
>>
>> On 16 September 2014 12:40, Debanjan Bhattacharyya <[email protected]>
>> wrote:
>> > Thanks Arnaud
>> >
>> > random_state is not listed as a parameter on
>> >
>> http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
>> > page.
>> > But it is listed as an argument in the constructor. Its my fault
>> probably -
>> > that I did not notice it as a passable parameter. May be the
>> documentation
>> > can be changed.
>> >
>> > In hind sight, and as a generic approach, if I am training without
>> > random_state, why and when would the boosted models vary highly ? (I
>> have
>> > seen data sets where they don't) ?
>> > And what should be the right approach on having stable CV ? Not using
>> > random_state and doing several rounds of CV and averaging it ? or using
>> > different random_states
>> > and doing several rounds of CV and averaging it ?
>> >
>> > What exactly goes behind random_state from a Gradient Boosting approach
>> ?
>> >
>> > Regards
>> > Deb
>> >
>> > On Tue, Sep 16, 2014 at 3:52 PM, Arnaud Joly <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >>
>> >> To get reproducible model, you have to set the random_state.
>> >>
>> >> Best regards,
>> >> Arnaud
>> >>
>> >>
>> >> On 16 Sep 2014, at 12:08, Debanjan Bhattacharyya <[email protected]
>> >
>> >> wrote:
>> >>
>> >> Hi I recently participated in the Atlas (Higgs Boson Machine Learning
>> >> Challenge)
>> >>
>> >> One of the models I tried was GradientBoostingClassifier. I found it
>> >> extremely non deterministic.
>> >> So if I use
>> >>
>> >> est = GradientBoostingClassifier(n_estimators=100,
>> >> max_depth=10,min_samples_leaf=20,max_features=6,verbose=1)
>> >>
>> >> and train several times on the same training set (full). I end up
>> having
>> >> models (significantly different in size - I mean pickle output) which
>> >> predict differently on the same instance. The difference is on the
>> scale of
>> >> 20 to 30% (so I have seen values varying between 0.7x and 0.4x) on the
>> same
>> >> instance. Even the (ordering) top 20 features (out of 30) differ from
>> model
>> >> to model quite significantly.
>> >>
>> >> Can someone tell me a bit more in details about this uncertainty.
>> >>
>> >> The train data set can be downloaded from here
>> >> https://www.kaggle.com/c/higgs-boson/data
>> >>
>> >>
>> >> Thanks
>> >>
>> >> Regards
>> >>
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> Want excitement?
>> >> Manually upgrade your production database.
>> >> When you want reliability, choose Perforce.
>> >> Perforce version control. Predictably reliable.
>> >>
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>
>> >>
>> >>
>> >>
>> >>
>> ------------------------------------------------------------------------------
>> >> Want excitement?
>> >> Manually upgrade your production database.
>> >> When you want reliability, choose Perforce.
>> >> Perforce version control. Predictably reliable.
>> >>
>> >>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> >> _______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >>
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce.
>> > Perforce version control. Predictably reliable.
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Want excitement?
>> Manually upgrade your production database.
>> When you want reliability, choose Perforce.
>> Perforce version control. Predictably reliable.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce.
> Perforce version control. Predictably reliable.
>
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Unpredictability of GradientBoosting

Reply via email to