On Wed, 11 Oct 2017, Anna Manelis wrote: > Dear Experts,
> I have an unbalanced dataset (group1=16 subjects ('better'), group2=12 > subjects ('worse'). is there any groupping of subjects or each has a dedicated "chunk"? > I would like to perform Monte-Carlo testing of the SMLR > between-subject classification analysis (performed on cope images). I used > the permutation_test.py script as an example and tried to adjust it for my > unbalanced dataset. > Any feedback on whether the script below makes sense is greatly appreciated. well -- all the disbalanced cases are "tricky" to say the least ;) I think, that for the peace of mind, you should also try your pipeline on the totally random data (ds.samples = np.random.normal(size=ds.shape)) to make sure that "garbage in, garbage out", and nothing is "significant" (well, should only be in "p" portion of the cases ;)) > lm = 0.2 # The value was determined through the nested cross-validation > procedure if it was nested, you were most likely to end up with different lm per each split. or in other words: how was it done and among which lm did you check? Feels like some double dipping might be going on > clf = SMLR(lm=lm, fit_all_weights=False, descr='SMLR(lm=%g)' % lm) > ### how often do we want to shuffle the data > repeater = Repeater(count=200) > ### permute the training part of a dataset exactly ONCE > permutator = AttributePermutator('targets', limit={'partitions': 1}, > count=1) > ### Make sure that each group ("better" and "worse") has the same number of > samples > par = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'), > Sifter([('partitions', 2), > ('targets', dict(uvalues=['better', 'worse'], > balanced=True))]) so, here you are saying to select the partitioning where you have balanced # of samples in the testing set... so training set will remain dis-balanced. That would lead SMLR to tend to classify as the "better". How does your cv.ca.stats look like -- is my fear fulfilled? here is an example: In [51]: import mvpa2.suite as mv In [52]: cv = mv.CrossValidation(mv.SMLR(), mv.NFoldPartitioner(), errorfx=mv.mean_mismatch_error, enable_ca=['stats']) In [53]: ds = mv.normal_feature_dataset(perlabel=10, nlabels=2, nfeatures=4, nonbogus_features=[0,1],snr=0) In [54]: res = cv(ds); print cv.ca.stats # on random data -- near chance result ----------. predictions\targets L0 L1 `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR MCC F1 AUC L0 5 6 11 9 6 5 0.45 0.44 0.5 0.4 0.55 -0.1 0.48 0.3 L1 5 4 9 11 5 6 0.44 0.45 0.4 0.5 0.56 -0.1 0.42 0.3 Per target: --- --- P 10 10 N 10 10 TP 5 4 TN 4 5 Summary \ Means: --- --- 10 10 5.5 5.5 0.45 0.45 0.45 0.45 0.55 -0.1 0.45 0.3 CHI^2 0.4 p=0.94 ACC 0.45 ACC% 45 # of sets 5 ACC(i) = 0.6-0.075*i p=0.32 r=-0.57 r^2=0.32 In [55]: ds_ = ds[[0, 2, 4, 6, 8, 10] + range(11, 20)] In [56]: print ds_.summary() Dataset: 15x4@float64, <sa: chunks,targets>, <fa: nonbogus_targets>, <a: bogus_features,nonbogus_features> stats: mean=0.00169964 std=0.315902 var=0.0997943 min=-0.600959 max=0.737558 Counts of targets in each chunk: chunks\targets L0 L1 --- --- 0 1 2 1 1 2 2 1 2 3 1 2 4 1 2 Summary for targets across chunks targets mean std min max #chunks L0 1 0 1 1 5 L1 2 0 2 2 5 Summary for chunks across targets chunks mean std min max #targets 0 1.5 0.5 1 2 2 1 1.5 0.5 1 2 2 2 1.5 0.5 1 2 2 3 1.5 0.5 1 2 2 4 1.5 0.5 1 2 2 Sequence statistics for 15 entries from set ['L0', 'L1'] Counter-balance table for orders up to 2: Targets/Order O1 | O2 | L0: 4 1 | 3 2 | L1: 0 9 | 0 8 | Correlations: min=-0.5 max=0.7 mean=-0.071 sum(abs)=5.8 In [57]: res = cv(ds_); print cv.ca.stats # on random and disbalanced data ----------. predictions\targets L0 L1 `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR MCC F1 AUC L0 1 3 4 11 3 4 0.25 0.64 0.2 0.7 0.75 -0.11 0.22 0.6 L1 4 7 11 4 4 3 0.64 0.25 0.7 0.2 0.36 -0.11 0.67 0.6 Per target: --- --- P 5 10 N 10 5 TP 1 7 TN 7 1 Summary \ Means: --- --- 7.5 7.5 3.5 3.5 0.44 0.44 0.45 0.45 0.56 -0.11 0.44 0.6 CHI^2 3.4 p=0.33 ACC 0.53 ACC% 53.33 # of sets 5 ACC(i) = 0.67-0.067*i p=0.65 r=-0.28 r^2=0.08 # although overall accuracy might be low, you can see that classifier tends to misclassify into L1 and overall mean accuracy is a "suboptimal" measure in this case. What we could have estimated was the mean of the accuracies per each class -- that one behaves better in case of the disbalanced datasets So in this case, with mean_mismatch_error (mean_match_accuracy reported in the ca.stats) we are getting In [63]: (1+7)/(5.+10) Out[63]: 0.5333333333333333 whenever we should use for such cases (and may be overall, while also taking care about chunks, since splits could be disbalanced differently, in this example I just assume a single "measurement"): In [62]: np.mean([1/5., 7/10.]) Out[62]: 0.44999999999999996 although numbers here are not really demonstrative, let's consider more critical case, where all the samples would have gone into the "majority" class: In [70]: cm = mv.ConfusionMatrix() In [71]: cm.add(['L1']*5 + ['L2']*10, ['L2'] * 15) In [72]: print cm ----------. predictions\targets L1 L2 `------ --- --- P' N' FP FN PPV NPV TPR SPC FDR MCC F1 L1 0 0 0 15 0 5 nan 0.67 0 1 nan 0 0 L2 5 10 15 0 5 0 0.67 nan 1 0 0.33 0 0.8 Per target: --- --- P 5 10 N 10 5 TP 0 10 TN 10 0 Summary \ Means: --- --- 7.5 7.5 2.5 2.5 nan nan 0.5 0.5 nan 0 0.4 CHI^2 15 p=0.0018 ACC 0.67 ACC% 66.67 # of sets 1 whohoo -- 66% correct! but if we compute the mean of per-class accuracies: In [73]: np.mean([0/5., 10/10.]) Out[73]: 0.5 1. Unfortunately "lazy" us didn't implement such a dedicated error function yet, BUT you have that value as a part of the In [84]: print cm.stats['mean(TPR)'] 0.5 which is the mean of TPRs per each label, which in turn is what we are interested here 2. I am not quite certain if your above approach (though see below comment about postproc) is completely without ground. Since you assure that the test set always has one sample of each class, training set is always consistently disbalanced, and your null distribution should account for all those "tend to classify as the majority class", so I expect p value for the true classification performance being "legit". how does your chance distribution look like? (distr_est.ca.dist_samples) meanwhile I will prep the "mean_tpr" function (which should match mean_match_accuracy in balanced case) and "mean_fnr" (which should match mean_mismatch_error in balanced case) (if I didn't mix anything up) to use in such cases. > ### I believe this will apply permutator on each fold created by par. > par_perm = ChainNode([par, permutator], space=par.get_space()) > # CV with null-distribution estimation that permutes the training data for > # each fold independently > null_cv = CrossValidation( > clf, > par_perm, > errorfx=mean_mismatch_error) such null_cv and cv below will provide err per each split (do print err to see that). You should add postproc=mean_sample() to both of them. > # Monte Carlo distribution estimator > distr_est = MCNullDist(repeater, tail='left', measure=null_cv, > enable_ca=['dist_samples']) > # actual CV with H0 distribution estimation > cv = CrossValidation(clf, par, errorfx=mean_mismatch_error, > null_dist=distr_est, enable_ca=['stats']) > err = cv(ds_mni_bw) > p = err.ca.null_prob > np.asscalar(p) -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa