Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?
Ohh, I can see now my mistake after reviewing the concept of bootstrapping and sampling with replacement. I was assuming that the "replacement" was made only after finishing each tree (i.e. If I was samping 2/3 of data, the very same data could be selected again for each tree, but no element would be repeated in a given tree). My apologies. Everything makes sense again On Sun, May 10, 2020, 19:42 Fernando Marcos Wittmann < fernando.wittm...@gmail.com> wrote: > Okay, so it's sampling with replacement with same size of the original > dataset. That mean that some of the samples would be repeated for each tree > > On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann < > fernando.wittm...@gmail.com> wrote: > >> My question is why the full dataset is being used as default when >> building each tree. That's not random forest. The main point of RF is to >> build each tree with a subsample of the full dataset >> >> On Sun, May 10, 2020, 09:50 Joel Nothman wrote: >> >>> A bootstrap is very commonly a random draw with replacement of equal >>> size to the original sample. >>> ___ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?
Okay, so it's sampling with replacement with same size of the original dataset. That mean that some of the samples would be repeated for each tree On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann < fernando.wittm...@gmail.com> wrote: > My question is why the full dataset is being used as default when building > each tree. That's not random forest. The main point of RF is to build each > tree with a subsample of the full dataset > > On Sun, May 10, 2020, 09:50 Joel Nothman wrote: > >> A bootstrap is very commonly a random draw with replacement of equal size >> to the original sample. >> ___ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?
My question is why the full dataset is being used as default when building each tree. That's not random forest. The main point of RF is to build each tree with a subsample of the full dataset On Sun, May 10, 2020, 09:50 Joel Nothman wrote: > A bootstrap is very commonly a random draw with replacement of equal size > to the original sample. > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?
A bootstrap is very commonly a random draw with replacement of equal size to the original sample. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?
When reading the documentation of Random Forest, I got the following: ``` max_samples : int or float, default=None If bootstrap is True, the number of samples to draw from X to train each base estimator. - *If None (default), then draw `X.shape[0]` samples.* - If int, then draw `max_samples` samples. - If float, then draw `max_samples * X.shape[0]` samples. Thus, `max_samples` should be in the interval `(0, 1)`. ``` Why does the whole dataset (i.e. X.shape[0] samples from X) is used to build each tree? That would be equivalent to bootstrap to be False, right? Wouldn't it be better practices to use as default 2/3 of the size of the dataset? ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn