Dear Scikit-learn community,
I have been reading some examples in
https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity
about the permutation importance that can be assessed after fitting a
tree-based model (e.g. RandomForestClassifier).
However, I have noticed a discrepancy that I would like to mention. If a
one-hot-encoding step is used before model fitting, the `.feature_importances_`
attribute includes importances for all the levels of the transformed
categorical features (e.g. for gender, we get 2 importances for males &
females, respectively.
When I apply the
`permutation_importance<https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance>`
functions though, the outputs correspond to the non-transformed data. To
illustrate this, I include a toy example in .py format.
Best,
Makis
#!/usr/bin/env python
# coding: utf-8
# In[1]:
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
# In[2]:
df = pd.DataFrame({'gender':['Male', 'Female', 'Male', 'Female', 'Female', 'Female'],
'group':[1,1,2,2,3,3],
'score':[10,30,20,40, 50,90]})
df
# In[5]:
# Make Pipeline
numeric_features = ['score']
categorical_features = ['gender']
numeric_transformer = Pipeline( steps=[ ("scaler", StandardScaler()) ] )
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features),
]
)
# make model
model = RandomForestClassifier(n_estimators=50, class_weight="balanced",random_state=42, n_jobs=-1)
pipe = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
# fit model
X = df.loc[:, (df.columns!="group")]
y = df.group
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
# In[6]:
# Plot feature importance
plt.figure(figsize=(12,10))
importances = pipe[-1].feature_importances_
feature_names = pipe[:-1].get_feature_names_out().tolist()
_=plt.bar(range(len(feature_names)), pipe[-1].feature_importances_)
_=plt.xticks(range(len(feature_names)), feature_names, rotation=90)
plt.title("Feature importances using MDI")
plt.ylabel("Mean decrease in impurity")
# In[16]:
# Try MDA: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#feature-importance-based-on-mean-decrease-in-impurity
from sklearn.inspection import permutation_importance
result = permutation_importance(pipe, X_test, y_test, n_repeats=1, random_state=42, n_jobs=2)
# In[17]:
result.importances_mean.shape
# In[18]:
# here the importances_mean has only 2 elements corresponding to Score and gender variables.
# BUT in the MDI case, pipe[-1].feature_importances_ includes the one-hot-encoded columns as well
# In[ ]:
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn