Dear Yaroslav, dear all, hooray for simulations! :-)
I was not aware of the profound effect on classification performance if the groups are not perfectly balanced. My tests using the 'npart' partitioner on clean and noisy test data showed the expected result (accuracy of 0.5 for non-signal voxels, 1 for the others). Cool! Still, two questions remain: a) Can I assess how the individual partitions look like (i.e. which subject is additionally removed to make the groups balanced)? b) How do I deal with groups that have a larger imbalance? I've tried with my dummy data already: If I feed a dataset with imbalanced group sizes into the classification with 'npart' partitioner the result is random classification at all voxels. In my original data I have more participants in the second group than in the first, so I would need to restrict the size of the second group given the size of the first for each partition. My idea was to take everyone from group 1 and randomly pick the same number of participants from group 2 - what's the best way to realize this? Thanks a lot! Ulrike ----- Original Message ----- From: "Yaroslav Halchenko" <[email protected]> To: "pkg-exppsy-pymvpa" <[email protected]> Sent: Tuesday, 17 November, 2015 16:18:20 Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel On Tue, 17 Nov 2015, Ulrike Kuhl wrote: > Here you go: > print DS_clean.summary() > Dataset: 20x3375@float32, <sa: chunks,subject,targets>, <fa: > modality,modality_index,voxel_indices>, <a: mapper> > stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1 > Counts of targets in each chunk: > chunks\targets 0 1 > --- --- > 0 1 0 > 1 1 0 > 2 1 0 > 3 1 0 > 4 1 0 > 5 1 0 > 6 1 0 > 7 1 0 > 8 1 0 > 9 1 0 > 10 0 1 > 11 0 1 > 12 0 1 > 13 0 1 > 14 0 1 > 15 0 1 > 16 0 1 > 17 0 1 > 18 0 1 > 19 0 1 so the problem is that in each chunk you have only one sample and overall you have only 20 samples to train on. Whenever you NFold partition it, you end up with 10 samples of one target and 9 of another. If there is a clear signal, error is minimized to correct labeling. If there is no signal, error is minimized to just always say "class with majority of samples (10 vs 9)" which then always leads to misclassification of the held out sample since it is of the opposite class. the fun was if you just ran it on real data -- most probably you would have got some strong negative bias but possibly still some reasonable around chance performances... and then would have scratched you head a lot. So -- simulations rule! ;) The simplest way to handle it: guarantee balanced number of samples from both categories in training (and thus testing) splits. There are two ways then to do it: 1. simplest but more ad-hoc. Group them all with chunks bringing two samples from both classes together, so you end up with 10 chunks and thus 10 splits if using NFold(1) 2. create a partitioner which would select all possible combinations from the two classes, i.e. have 10*10=100 splits. Two ways to do it a. with existing codebase smth like this should work npart = ChainNode([ NFoldPartitioner(len(ds.sa['targets'].unique), attr='chunks'), ## so it should select only those splits where we took 1 from ## each of the targets categories leaving things in balance Sifter([('partitions', 2), ('targets', { 'uvalues': ds.sa['targets'].unique, 'balanced': True}) ]), ], space='partitions') which will do in your case NFold(2) across chunks, thus select every combination of two chunks, but then use only those (Sifter removes others) which have balanced targets. b. WiP https://github.com/PyMVPA/PyMVPA/pull/386 to simplify this so it would look like factpart = FactorialPartitioner( NFoldPartitioner(attr='chunks'), attr='targets' ) N.B. Matteo -- one more testcase to test! ;) -- Yaroslav O. Halchenko Center for Open Neuroscience http://centerforopenneuroscience.org Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755 Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419 WWW: http://www.linkedin.com/in/yarik _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa -- Max Planck Institute for Human Cognitive and Brain Sciences Department of Neuropsychology (A219) Stephanstraße 1a 04103 Leipzig Phone: +49 (0) 341 9940 2625 Mail: [email protected] Internet: http://www.cbs.mpg.de/staff/kuhl-12160 _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

