Say the probability that a given sample from a dataset of size n is *not* drawn 
as a bootstrap sample is

P(not_chosen) = (1 - 1\n)^n

Since you have a 1/n chance to draw a particular sample (since bootstrapping 
involves drawing with replacement), which you repeat n times to get a n-sized 
bootstrap sample.

This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n)

Then, you can compute the probability of a sample being chosen as

P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 

Best,
Sebastian

> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn 
> <scikit-learn@python.org> wrote:
> 
> Hi,
> 
> Thank you for the reply. Please bear with me for a while.
> 
> From where did this number, 0.632, come? I have no background in statistics 
> (which appears to be the case here!). Or let me rephrase my query: what is 
> this bootstrap sampling all about? Searched the web, but didn't get 
> satisfactory results.
> 
> 
> Thanks
> 
> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> > From whatever little knowledge I gained last night about Random Forests, 
> > each tree is trained with a sub-sample of original dataset (usually with 
> > replacement)?.
> 
> Yes, that should be correct!
> 
> > Now, what I am not able to understand is - if entire dataset is used to 
> > train each of the trees, then how does the classifier estimates the OOB 
> > error? None of the entries of the dataset is an oob for any of the trees. 
> > (Pardon me if all this sounds BS)
> 
> If you take an n-size bootstrap sample, where n is the number of samples in 
> your dataset, you have asymptotically 0.632 * n unique samples in your 
> bootstrap set. Or in other words 0.368 * n samples are not used for growing 
> the respective tree (to compute the OOB). As far as I understand, the random 
> forest OOB score is then computed as the average OOB of each tee (correct me 
> if I am wrong!).
> 
> Best,
> Sebastian
> 
> > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn 
> > <scikit-learn@python.org> wrote:
> >
> > Dear Developers,
> >
> > From whatever little knowledge I gained last night about Random Forests, 
> > each tree is trained with a sub-sample of original dataset (usually with 
> > replacement)?.
> >
> > (Note: Please do correct me if I am not making any sense.)
> >
> > RandomForestClassifier has an option of 'bootstrap'. The API states the 
> > following
> >
> > The sub-sample size is always the same as the original input sample size 
> > but the samples are drawn with replacement if bootstrap=True (default).
> >
> > Now, what I am not able to understand is - if entire dataset is used to 
> > train each of the trees, then how does the classifier estimates the OOB 
> > error? None of the entries of the dataset is an oob for any of the trees. 
> > (Pardon me if all this sounds BS)
> >
> > Help this mere mortal.
> >
> > Thanks
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to