[scikit-learn] Discrepancy in "Feature importances with a forest of trees" documentation

Serafeim Loukas Fri, 28 Oct 2022 02:20:01 -0700

Dear Scikit-learn community,


I have been reading some examples in 
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity
 about the permutation importance that can be assessed after fitting a 
tree-based model (e.g. RandomForestClassifier).

However, I have noticed a discrepancy that I would like to mention. If a 
one-hot-encoding step is used before model fitting, the `.feature_importances_` 
attribute includes importances for all the levels of the transformed 
categorical features (e.g. for gender, we get 2 importances for males & 
females, respectively.

When I apply the 
`permutation_importance<https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance>`
 functions though, the outputs correspond to the non-transformed data. To 
illustrate this, I include a toy example in .py format.

Best,
Makis

#!/usr/bin/env python
# coding: utf-8

# In[1]:


from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt


# In[2]:


df = pd.DataFrame({'gender':['Male', 'Female', 'Male', 'Female', 'Female', 'Female'], 
                   'group':[1,1,2,2,3,3], 
                   'score':[10,30,20,40, 50,90]})
df


# In[5]:


# Make Pipeline
numeric_features = ['score']
categorical_features = ['gender']

numeric_transformer = Pipeline( steps=[ ("scaler", StandardScaler()) ] )
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# make model
model = RandomForestClassifier(n_estimators=50, class_weight="balanced",random_state=42, n_jobs=-1)

pipe = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])


# fit model
X = df.loc[:, (df.columns!="group")]
y = df.group

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)


# In[6]:


# Plot feature importance

plt.figure(figsize=(12,10))

importances = pipe[-1].feature_importances_
feature_names = pipe[:-1].get_feature_names_out().tolist()

_=plt.bar(range(len(feature_names)), pipe[-1].feature_importances_)
_=plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.title("Feature importances using MDI")
plt.ylabel("Mean decrease in impurity")


# In[16]:


# Try MDA: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity

from sklearn.inspection import permutation_importance

result = permutation_importance(pipe, X_test, y_test, n_repeats=1, random_state=42, n_jobs=2)


# In[17]:


result.importances_mean.shape


# In[18]:


# here the importances_mean has only 2 elements corresponding to Score and gender variables.

# BUT in the MDI case, pipe[-1].feature_importances_ includes the one-hot-encoded columns as well


# In[ ]:

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] Discrepancy in "Feature importances with a forest of trees" documentation

Reply via email to