Re: [scikit-learn] Number of informative features vs total number of features

Andreas Mueller Fri, 03 Apr 2020 07:53:58 -0700

Hi Ben.
I'd recommend you check the code to see how the data is generated.


Best,
Andy

On 4/3/20 7:00 AM, Benoît Presles wrote:

Dear sklearn users,
I have just checked if the generated features were independents bycomputing the covariance and correlation matrices and it seems theyare, so I really do not understand my results.
Any idea ?

Thanks for your help,
Best regards,
Ben


Le 31/03/2020 à 15:48, Benoît Presles a écrit :
Dear sklearn users,
I did some supervised classification simulations with themake_classification function from sklearn increasing the number ofinformative features from 1 out of 40 to 40 out of 40 (100%). I didnot generate any repeated or redundant features. I fixed the numberof classes to two and the number of clusters per class to one.
I split the dataset 100 times using the StratifiedShuffleSplitfunction into two subsets: a training set and a test set (80% - 20%).I performed a logistic regression and calculated training and testingaccuracies and averaged the results over the 100 splits leading to amean training accuracy and a mean testing accuracy.
I was expecting to get an increasing accuracy score as a function ofinformative features for both the training and the test sets. On thecontrary, I have got the best training and test scores for oneinformative feature. Why do I get these results ?
Thanks for your help,
Best regards,
Ben

Below the simulation code I have written:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

RANDOM_SEED = 4
n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])

mean_training_score_array = np.array([])
mean_testing_score_array = np.array([])
for n_inf_value in n_inf:
    X, y = make_classification(n_samples=2500,
                               n_features=40,
                               n_informative=n_inf_value,
                               n_redundant=0,
                               n_repeated=0,
                               n_classes=2,
                               n_clusters_per_class=1,
                               random_state=RANDOM_SEED,
                               shuffle=False)
    #
print('Simulated data - number of informative features = ' +str(n_inf_value))
    #
sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2,random_state=RANDOM_SEED)
    training_score_array = np.array([])
    testing_score_array = np.array([])
    for train_index_split, test_index_split in sss.split(X, y):
X_split_train, X_split_test = X[train_index_split],X[test_index_split] y_split_train, y_split_test = y[train_index_split],y[test_index_split]
        scaler = StandardScaler()
        X_split_train = scaler.fit_transform(X_split_train)
        X_split_test = scaler.transform(X_split_test)
lr = LogisticRegression(fit_intercept=True, max_iter=1e9,verbose=0, random_state=RANDOM_SEED,solver='lbfgs', tol=1e-6, C=10)
        lr.fit(X_split_train, y_split_train)
        y_pred_train = lr.predict(X_split_train)
        y_pred_test = lr.predict(X_split_test)
accuracy_train_score = accuracy_score(y_split_train,y_pred_train)
        accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
training_score_array = np.append(training_score_array,accuracy_train_score) testing_score_array = np.append(testing_score_array,accuracy_test_score) mean_training_score_array = np.append(mean_training_score_array,np.average(training_score_array)) mean_testing_score_array = np.append(mean_testing_score_array,np.average(testing_score_array))
#
print('mean_training_score_array=' + str(mean_training_score_array))
print('mean_testing_score_array=' + str(mean_testing_score_array))
#
plt.plot(n_inf, mean_training_score_array, 'r', label='mean trainingscore')plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testingscore')
plt.xlabel('number of informative features out of 40')
plt.ylabel('accuracy')
plt.legend()
plt.show()

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Number of informative features vs total number of features

Reply via email to