I¹m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, and when I plot an ROC curve for the folds, the results look very strange, as there are a handful of clustered points on the graph. Here is my cross validation code, I based it off of the samples on the scikit website:
skf = StratifiedKFold(y, n_folds=numfolds)
for train_index, test_index in skf:
#split the training and testing sets
X_train, X_test = X_scaled[train_index], X_scaled[test_index]
y_train, y_test = y[train_index], y[test_index]
#train on the subset for this fold
print 'Training on fold ' + str(fold)
classifier = svm.SVC(C=C_val, kernel='rbf', gamma=gamma_val,
probability=True)
probas_ = classifier.fit(X_train, y_train).predict_proba(X_test)
#Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y_test, probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
roc_auc = auc(fpr, tpr)
I¹m just trying to figure out if there¹s something I¹m obviously missing
here, since I used this same training set and SVM parameters with libsvm and
got much better results. When I used libsvm and printed out the distances
from the hyperplane for the CV test instances and then plotted the ROC, it
came out much more like I expected, and a much better AUC. Any pointers
would be greatly appreciated!
Brett Meyer
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------------------------------------------ HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions Find What Matters Most in Your Big Data with HPCC Systems Open Source. Fast. Scalable. Simple. Ideal for Dirty Data. Leverages Graph Analysis for Fast Processing & Easy Data Exploration http://p.sf.net/sfu/hpccsystems
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
