Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?

2020-05-10 Thread Fernando Marcos Wittmann
Ohh, I can see now my mistake after reviewing the concept of bootstrapping
and sampling with replacement. I was assuming that the "replacement" was
made only after finishing each tree (i.e. If I was samping 2/3 of data, the
very same data could be selected again for each tree, but no element would
be repeated in a given tree). My apologies. Everything makes sense again

On Sun, May 10, 2020, 19:42 Fernando Marcos Wittmann <
fernando.wittm...@gmail.com> wrote:

> Okay, so it's sampling with replacement with same size of the original
> dataset. That mean that some of the samples would be repeated for each tree
>
> On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann <
> fernando.wittm...@gmail.com> wrote:
>
>> My question is why the full dataset is being used as default when
>> building each tree. That's not random forest. The main point of RF is to
>> build each tree with a subsample of the full dataset
>>
>> On Sun, May 10, 2020, 09:50 Joel Nothman  wrote:
>>
>>> A bootstrap is very commonly a random draw with replacement of equal
>>> size to the original sample.
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?

2020-05-10 Thread Fernando Marcos Wittmann
Okay, so it's sampling with replacement with same size of the original
dataset. That mean that some of the samples would be repeated for each tree

On Sun, May 10, 2020, 19:40 Fernando Marcos Wittmann <
fernando.wittm...@gmail.com> wrote:

> My question is why the full dataset is being used as default when building
> each tree. That's not random forest. The main point of RF is to build each
> tree with a subsample of the full dataset
>
> On Sun, May 10, 2020, 09:50 Joel Nothman  wrote:
>
>> A bootstrap is very commonly a random draw with replacement of equal size
>> to the original sample.
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?

2020-05-10 Thread Fernando Marcos Wittmann
My question is why the full dataset is being used as default when building
each tree. That's not random forest. The main point of RF is to build each
tree with a subsample of the full dataset

On Sun, May 10, 2020, 09:50 Joel Nothman  wrote:

> A bootstrap is very commonly a random draw with replacement of equal size
> to the original sample.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?

2020-05-10 Thread Joel Nothman
A bootstrap is very commonly a random draw with replacement of equal size
to the original sample.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Why the default max_samples of Random Forest is X.shape[0]?

2020-05-08 Thread Fernando Marcos Wittmann
When reading the documentation of Random Forest, I got the following:
```
max_samples : int or float, default=None If bootstrap is True, the number
of samples to draw from X to train each base estimator. - *If None
(default), then draw `X.shape[0]` samples.* - If int, then draw
`max_samples` samples. - If float, then draw `max_samples * X.shape[0]`
samples. Thus, `max_samples` should be in the interval `(0, 1)`.
```

Why does the whole dataset (i.e. X.shape[0] samples from X) is used to
build each tree? That would be equivalent to bootstrap to be False, right?
Wouldn't it be better practices to use as default 2/3 of the size of the
dataset?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn