Or maybe more intuitively, you can visualize this asymptotic behavior e.g., via
import matplotlib.pyplot as plt vs = [] for n in range(5, 201, 5): v = 1 - (1. - 1./n)**n vs.append(v) plt.plot([n for n in range(5, 201, 5)], vs, marker='o', markersize=6, alpha=0.5,) plt.xlabel('n') plt.ylabel('1 - (1 - 1/n)^n') plt.xlim([0, 210]) plt.show() > On Oct 3, 2016, at 3:15 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > Say the probability that a given sample from a dataset of size n is *not* > drawn as a bootstrap sample is > > P(not_chosen) = (1 - 1\n)^n > > Since you have a 1/n chance to draw a particular sample (since bootstrapping > involves drawing with replacement), which you repeat n times to get a n-sized > bootstrap sample. > > This is asymptotically "1/e approx. 0.368” (i.e., for very, very large n) > > Then, you can compute the probability of a sample being chosen as > > P(chosen) = 1 - (1 - 1/n)^n approx. 0.632 > > Best, > Sebastian > >> On Oct 3, 2016, at 3:05 PM, Ibrahim Dalal via scikit-learn >> <scikit-learn@python.org> wrote: >> >> Hi, >> >> Thank you for the reply. Please bear with me for a while. >> >> From where did this number, 0.632, come? I have no background in statistics >> (which appears to be the case here!). Or let me rephrase my query: what is >> this bootstrap sampling all about? Searched the web, but didn't get >> satisfactory results. >> >> >> Thanks >> >> On Tue, Oct 4, 2016 at 12:02 AM, Sebastian Raschka <se.rasc...@gmail.com> >> wrote: >>> From whatever little knowledge I gained last night about Random Forests, >>> each tree is trained with a sub-sample of original dataset (usually with >>> replacement)?. >> >> Yes, that should be correct! >> >>> Now, what I am not able to understand is - if entire dataset is used to >>> train each of the trees, then how does the classifier estimates the OOB >>> error? None of the entries of the dataset is an oob for any of the trees. >>> (Pardon me if all this sounds BS) >> >> If you take an n-size bootstrap sample, where n is the number of samples in >> your dataset, you have asymptotically 0.632 * n unique samples in your >> bootstrap set. Or in other words 0.368 * n samples are not used for growing >> the respective tree (to compute the OOB). As far as I understand, the random >> forest OOB score is then computed as the average OOB of each tee (correct me >> if I am wrong!). >> >> Best, >> Sebastian >> >>> On Oct 3, 2016, at 2:25 PM, Ibrahim Dalal via scikit-learn >>> <scikit-learn@python.org> wrote: >>> >>> Dear Developers, >>> >>> From whatever little knowledge I gained last night about Random Forests, >>> each tree is trained with a sub-sample of original dataset (usually with >>> replacement)?. >>> >>> (Note: Please do correct me if I am not making any sense.) >>> >>> RandomForestClassifier has an option of 'bootstrap'. The API states the >>> following >>> >>> The sub-sample size is always the same as the original input sample size >>> but the samples are drawn with replacement if bootstrap=True (default). >>> >>> Now, what I am not able to understand is - if entire dataset is used to >>> train each of the trees, then how does the classifier estimates the OOB >>> error? None of the entries of the dataset is an oob for any of the trees. >>> (Pardon me if all this sounds BS) >>> >>> Help this mere mortal. >>> >>> Thanks >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn