Thanks, Javier,
however, the max_features is n_features by default. But if you execute sth like
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
for i in range(20):
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
print(tree.score(X_test, y_test))
You will find that the tree will produce different results if you don't fix the
random seed. I suspect, related to what you said about the random feature
selection if max_features is not n_features, that there is generally some
sorting of the features going on, and the different trees are then due to
tie-breaking if two features have the same information gain?
Best,
Sebastian
> On Oct 27, 2018, at 6:16 PM, Javier López <[email protected]> wrote:
>
> Hi Sebastian,
>
> I think the random state is used to select the features that go into each
> split (look at the `max_features` parameter)
>
> Cheers,
> Javier
>
> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka
> <[email protected]> wrote:
> Hi all,
>
> when I was implementing a bagging classifier based on scikit-learn's
> DecisionTreeClassifier, I noticed that the results were not deterministic and
> found that this was due to the random_state in the DescisionTreeClassifier
> (which is set to None by default).
>
> I am wondering what exactly this random state is used for? I can imaging it
> being used for resolving ties if the information gain for multiple features
> is the same, or it could be that the feature splits of continuous features is
> different? (I thought the heuristic is to sort the features and to consider
> those feature values next to each associated with examples that have
> different class labels -- but is there maybe some random subselection
> involved?)
>
> If someone knows more about this, where the random_state is used, I'd be
> happy to hear it :)
>
> Also, we could then maybe add the info to the DecisionTreeClassifier's
> docstring, which is currently a bit too generic to be useful, I think:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>
>
> random_state : int, RandomState instance or None, optional (default=None)
> If int, random_state is the seed used by the random number generator;
> If RandomState instance, random_state is the random number generator;
> If None, the random number generator is the RandomState instance used
> by `np.random`.
>
>
> Best,
> Sebastian
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn