Say the probability that a given sample from a dataset of size n is *not* drawn as a bootstrap sample is
P(not_chosen) = (1 - 1\n)^n Since you have a 1/n chance to draw a particular sample (since bootstrapping involves drawing with replacement), which you repeat n times to get a n-sized bootstrap sample. This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n) Then, you can compute the probability of a sample being chosen as P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 Best, Sebastian > On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn > <scikit-learn@python.org> wrote: > > Hi, > > Thank you for the reply. Please bear with me for a while. > > From where did this number, 0.632, come? I have no background in statistics > (which appears to be the case here!). Or let me rephrase my query: what is > this bootstrap sampling all about? Searched the web, but didn't get > satisfactory results. > > > Thanks > > On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > > From whatever little knowledge I gained last night about Random Forests, > > each tree is trained with a sub-sample of original dataset (usually with > > replacement)?. > > Yes, that should be correct! > > > Now, what I am not able to understand is - if entire dataset is used to > > train each of the trees, then how does the classifier estimates the OOB > > error? None of the entries of the dataset is an oob for any of the trees. > > (Pardon me if all this sounds BS) > > If you take an n-size bootstrap sample, where n is the number of samples in > your dataset, you have asymptotically 0.632 * n unique samples in your > bootstrap set. Or in other words 0.368 * n samples are not used for growing > the respective tree (to compute the OOB). As far as I understand, the random > forest OOB score is then computed as the average OOB of each tee (correct me > if I am wrong!). > > Best, > Sebastian > > > On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn > > <scikit-learn@python.org> wrote: > > > > Dear Developers, > > > > From whatever little knowledge I gained last night about Random Forests, > > each tree is trained with a sub-sample of original dataset (usually with > > replacement)?. > > > > (Note: Please do correct me if I am not making any sense.) > > > > RandomForestClassifier has an option of 'bootstrap'. The API states the > > following > > > > The sub-sample size is always the same as the original input sample size > > but the samples are drawn with replacement if bootstrap=True (default). > > > > Now, what I am not able to understand is - if entire dataset is used to > > train each of the trees, then how does the classifier estimates the OOB > > error? None of the entries of the dataset is an oob for any of the trees. > > (Pardon me if all this sounds BS) > > > > Help this mere mortal. > > > > Thanks > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn