Hi I have recently started machine learning and it is my first query regarding prediction accuracy.
There is difference in prediction accuracy using SGDClassifier and Cross validation scores. import numpy as np from sklearn.datasets import fetch_openml from sklearn.linear_model import SGDClassifier mnist = fetch_openml('mnist_784', version=1, cache=True) X, y = mnist['data'], mnist['target'] X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] shuffled_index = np.random.permutation(60000) # shuffle the 0 - 60000 range X_train, y_train = X_train[shuffled_index], y_train[shuffled_index] y_train_5 = (y_train == '5') y_test_5 = (y_test == '5') sgd_clf = SGDClassifier(random_state=42, tol=1e-3, max_iter=1000) sgd_clf.fit(X_train, y_train_5) # Predicting for all 5s print("####### PREDICTION STATS ##############") y_train_5_pred = sgd_clf.predict(X_train) print("Total y_train_5 [False|True both]]:", len(y_train_5)) print("Total y_train_5 [Only 5s]:", sum(y_train_5)) # some other digit may be predicted as 5 and some 5s may be predicted as not 5 print("Predicted 5s:", sum(y_train_5_pred)) correctly_predicted = sum(np.logical_and(y_train_5_pred, y_train_5)) print("Correct Predicted", correctly_predicted) print("Accuracy:", correctly_predicted/sum(y_train_5) * 100) from sklearn.model_selection import cross_val_score cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy') *MY Output* ####### PREDICTION STATS ############## Total y_train_5 [False|True both]]: 60000 Total y_train_5 [Only 5s]: 5421 Predicted 5s: 3863 Correct Predicted 3574*Accuracy: 65.9287954251983* array([*0.9323 , 0.96805, 0.9641* ]) ####################################### So as per my observation there is a difference, why? SGDCLassifier is *~65.92%* accurate cross_val_score are *~95%* Am I comparing it in wrong way? OR I am missing something? Thanks Rajnish
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn