Hi

I have recently started machine learning and it is my first query regarding
prediction accuracy.

There is difference in prediction accuracy using SGDClassifier and Cross
validation scores.

import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.linear_model import SGDClassifier

mnist = fetch_openml('mnist_784', version=1, cache=True)
X, y = mnist['data'], mnist['target']
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000],
y[60000:]
shuffled_index = np.random.permutation(60000) # shuffle the 0 - 60000 range
X_train, y_train = X_train[shuffled_index], y_train[shuffled_index]

y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

sgd_clf = SGDClassifier(random_state=42, tol=1e-3, max_iter=1000)
sgd_clf.fit(X_train, y_train_5)

# Predicting for all 5s
print("####### PREDICTION STATS ##############")
y_train_5_pred = sgd_clf.predict(X_train)

print("Total y_train_5 [False|True both]]:", len(y_train_5))
print("Total y_train_5 [Only 5s]:", sum(y_train_5))

# some other digit may be predicted as 5 and some 5s may be predicted as
not 5
print("Predicted 5s:", sum(y_train_5_pred))

correctly_predicted = sum(np.logical_and(y_train_5_pred, y_train_5))
print("Correct Predicted", correctly_predicted)
print("Accuracy:", correctly_predicted/sum(y_train_5) * 100)

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')

*MY Output*

####### PREDICTION STATS ##############
Total y_train_5 [False|True both]]: 60000
Total y_train_5 [Only 5s]: 5421
Predicted 5s: 3863
Correct Predicted 3574*Accuracy: 65.9287954251983*
array([*0.9323 , 0.96805, 0.9641* ])
#######################################

So as per my observation there is a difference, why?

SGDCLassifier is *~65.92%* accurate
cross_val_score are *~95%*

Am I comparing it in wrong way? OR I am missing something?


Thanks

Rajnish
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to