Re: [scikit-learn] Random Forest with Bootstrapping

Sebastian Raschka Mon, 03 Oct 2016 13:01:34 -0700

> Hi,
> 
> That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> 
> Won't an n-sized bootstrapped sample have repeated entries? Say we have an 
> original dataset of size 100. A bootstrap sample (say, B) of size 100 is 
> drawn from this set. Since 32 of the original samples are left out 
> (theoretically at least), some of the samples in B must be repeated?


Yeah, you'll definitely have duplications, that’s why (if you have an 
infinitely large n) only 0.632*n samples are unique ;).

Say your dataset is 

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] (where the numbers represent the indices of your 
data points)

then a bootstrap sample could be

[9, 1, 1, 0, 4, 4, 5, 7, 9, 9] and your left out sample is consequently [2, 3, 
6, 8]


> On Oct 3, 2016, at 3:36 PM, Ibrahim Dalal via scikit-learn 
> <scikit-learn@python.org> wrote:
> 
> Hi,
> 
> That helped a lot. Thank you very much. I have one more (silly?) doubt though.
> 
> Won't an n-sized bootstrapped sample have repeated entries? Say we have an 
> original dataset of size 100. A bootstrap sample (say, B) of size 100 is 
> drawn from this set. Since 32 of the original samples are left out 
> (theoretically at least), some of the samples in B must be repeated?
> 
> On Tue, Oct 4, 2016 at 12:50 AM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> Or maybe more intuitively, you can visualize this asymptotic behavior e.g., 
> via
> 
> import matplotlib.pyplot as plt
> 
> vs = []
> for n in range(5, 201, 5):
>     v = 1 - (1. - 1./n)**n
>     vs.append(v)
> 
> plt.plot([n for n in range(5, 201, 5)], vs, marker='o',
>           markersize=6,
>           alpha=0.5,)
> 
> plt.xlabel('n')
> plt.ylabel('1 - (1 - 1/n)^n')
> plt.xlim([0, 210])
> plt.show()
> 
> > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> >
> > Say the probability that a given sample from a dataset of size n is *not* 
> > drawn as a bootstrap sample is
> >
> > P(not_chosen) = (1 - 1\n)^n
> >
> > Since you have a 1/n chance to draw a particular sample (since 
> > bootstrapping involves drawing with replacement), which you repeat n times 
> > to get a n-sized bootstrap sample.
> >
> > This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)
> >
> > Then, you can compute the probability of a sample being chosen as
> >
> > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632
> >
> > Best,
> > Sebastian
> >
> >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn 
> >> <scikit-learn@python.org> wrote:
> >>
> >> Hi,
> >>
> >> Thank you for the reply. Please bear with me for a while.
> >>
> >> From where did this number, 0.632, come? I have no background in 
> >> statistics (which appears to be the case here!). Or let me rephrase my 
> >> query: what is this bootstrap sampling all about? Searched the web, but 
> >> didn't get satisfactory results.
> >>
> >>
> >> Thanks
> >>
> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.rasc...@gmail.com> 
> >> wrote:
> >>> From whatever little knowledge I gained last night about Random Forests, 
> >>> each tree is trained with a sub-sample of original dataset (usually with 
> >>> replacement)?.
> >>
> >> Yes, that should be correct!
> >>
> >>> Now, what I am not able to understand is - if entire dataset is used to 
> >>> train each of the trees, then how does the classifier estimates the OOB 
> >>> error? None of the entries of the dataset is an oob for any of the trees. 
> >>> (Pardon me if all this sounds BS)
> >>
> >> If you take an n-size bootstrap sample, where n is the number of samples 
> >> in your dataset, you have asymptotically 0.632 * n unique samples in your 
> >> bootstrap set. Or in other words 0.368 * n samples are not used for 
> >> growing the respective tree (to compute the OOB). As far as I understand, 
> >> the random forest OOB score is then computed as the average OOB of each 
> >> tee (correct me if I am wrong!).
> >>
> >> Best,
> >> Sebastian
> >>
> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn 
> >>> <scikit-learn@python.org> wrote:
> >>>
> >>> Dear Developers,
> >>>
> >>> From whatever little knowledge I gained last night about Random Forests, 
> >>> each tree is trained with a sub-sample of original dataset (usually with 
> >>> replacement)?.
> >>>
> >>> (Note: Please do correct me if I am not making any sense.)
> >>>
> >>> RandomForestClassifier has an option of 'bootstrap'. The API states the 
> >>> following
> >>>
> >>> The sub-sample size is always the same as the original input sample size 
> >>> but the samples are drawn with replacement if bootstrap=True (default).
> >>>
> >>> Now, what I am not able to understand is - if entire dataset is used to 
> >>> train each of the trees, then how does the classifier estimates the OOB 
> >>> error? None of the entries of the dataset is an oob for any of the trees. 
> >>> (Pardon me if all this sounds BS)
> >>>
> >>> Help this mere mortal.
> >>>
> >>> Thanks
> >>> _______________________________________________
> >>> scikit-learn mailing list
> >>> scikit-learn@python.org
> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random Forest with Bootstrapping

Reply via email to