Hi! I'm currently re-checking some code I wrote up a while ago when I stumbled upon something strange: my balancer does not seem to produce balanced groups for training and testing.
I have a toy dataset DS_noisy (you can find it here: https://bigmail.cbs.mpg.de/i/d218f77e0f8ce3cbd5c301484ece951a.hdf5) DS_noisy.summary() gives: Dataset: 11x3472@float32, <sa: chunks,subject,targets>, <fa: modality,modality_index,voxel_indices>, <a: imghdr,mapper> stats: mean=-7.70839e-07 std=0.999997 var=0.999994 min=-1.75513 max=3.98463 Counts of targets in each chunk: chunks\targets 0 1 --- --- 0 1 0 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 0 1 8 0 1 9 0 1 10 0 1 Summary for targets across chunks targets mean std min max #chunks 0 0.636 0.481 0 1 7 1 0.364 0.481 0 1 4 Summary for chunks across targets chunks mean std min max #targets 0 0.5 0.5 0 1 1 1 0.5 0.5 0 1 1 2 0.5 0.5 0 1 1 3 0.5 0.5 0 1 1 4 0.5 0.5 0 1 1 5 0.5 0.5 0 1 1 6 0.5 0.5 0 1 1 7 0.5 0.5 0 1 1 8 0.5 0.5 0 1 1 9 0.5 0.5 0 1 1 10 0.5 0.5 0 1 1 Sequence statistics for 11 entries from set [0, 1] Counter-balance table for orders up to 2: Targets/Order O1 | O2 | 0: 6 1 | 5 2 | 1: 0 3 | 0 2 | Correlations: min=-0.57 max=0.61 mean=-0.1 sum(abs)=4.3 As you can see, the number of participants per group is imbalanced. For a classification using an SVM with a searchlight I want to partition the data into balanced groups. For this, I use the following partitioner: npart = ChainNode([ NFoldPartitioner(len(DS_noisy.sa['targets'].unique), attr='chunks'), Sifter([('partitions', 2), ('targets', { 'uvalues': DS_noisy.sa['targets'].unique, 'balanced': True}) ]), Balancer(attr='targets',count=1,limit='partitions',apply_selection=True) ], space='partitions') Finally I look at the partitions to check if everything is fine: for ds_ in npart.generate(DS_noisy): print('A new split:') print('Testing:') testing = DS_noisy[ds_.sa.partitions == 2] print list(zip(testing.sa.chunks, testing.sa.targets)) print('Training:') training = DS_noisy[ds_.sa.partitions == 1] print list(zip(training.sa.chunks, training.sa.targets)) The output baffles me: A new split: Testing: [(0, 0), (4, 0)] Training: [(1, 0), (2, 0), (3, 0), (5, 0), (6, 0), (7, 1)] A new split: Testing: [(0, 0), (5, 0)] Training: [(1, 0), (2, 0), (3, 0), (4, 0), (6, 0), (7, 1)] A new split: Testing: [(0, 0), (6, 0)] Training: [(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (7, 1)] ... Problem: The targets in the training data are far from being balanced (in fact, it only ever uses the (7, 1) combination, the remainder are always target 0 items). Also: shouldn't the two test items always be one target 0 and one target 1 item? Am I doing something wrong or am I looking at it the wrong way? Any help is appreciated! Best, Ulrike -- Max Planck Institute for Human Cognitive and Brain Sciences Department of Neuropsychology (A219) Stephanstraße 1a 04103 Leipzig Phone: +49 (0) 341 9940 2625 Mail: [email protected] Internet: http://www.cbs.mpg.de/staff/kuhl-12160 _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list [email protected] http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

