Re: [scikit-learn] How does the random state influence the decision tree splits?

Sebastian Raschka Sat, 27 Oct 2018 18:03:34 -0700

Thanks, Javier,

however, the max_features is n_features by default. But if you execute sth like


import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.3,
                                                    random_state=123,
                                                    shuffle=True,
                                                    stratify=y)

for i in range(20):
    tree = DecisionTreeClassifier()
    tree.fit(X_train, y_train)
    print(tree.score(X_test, y_test))



You will find that the tree will produce different results if you don't fix the 
random seed. I suspect, related to what you said about the random feature 
selection if max_features is not n_features, that there is generally some 
sorting of the features going on, and the different trees are then due to 
tie-breaking if two features have the same information gain?

Best,
Sebastian



> On Oct 27, 2018, at 6:16 PM, Javier López <[email protected]> wrote:
> 
> Hi Sebastian,
> 
> I think the random state is used to select the features that go into each 
> split (look at the `max_features` parameter)
> 
> Cheers,
> Javier
> 
> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka 
> <[email protected]> wrote:
> Hi all,
> 
> when I was implementing a bagging classifier based on scikit-learn's 
> DecisionTreeClassifier, I noticed that the results were not deterministic and 
> found that this was due to the random_state in the DescisionTreeClassifier 
> (which is set to None by default).
> 
> I am wondering what exactly this random state is used for? I can imaging it 
> being used for resolving ties if the information gain for multiple features 
> is the same, or it could be that the feature splits of continuous features is 
> different? (I thought the heuristic is to sort the features and to consider 
> those feature values next to each associated with examples that have 
> different class labels -- but is there maybe some random subselection 
> involved?)
> 
> If someone knows more about this, where the random_state is used, I'd be 
> happy to hear it :)
> 
> Also, we could then maybe add the info to the DecisionTreeClassifier's 
> docstring, which is currently a bit too generic to be useful, I think:
> 
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
> 
> 
>     random_state : int, RandomState instance or None, optional (default=None)
>         If int, random_state is the seed used by the random number generator;
>         If RandomState instance, random_state is the random number generator;
>         If None, the random number generator is the RandomState instance used
>         by `np.random`.
> 
> 
> Best,
> Sebastian
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] How does the random state influence the decision tree splits?

Reply via email to