Re: [pymvpa] Dataset with multidimensional feature vector per voxel

Ulrike Kuhl Thu, 19 Nov 2015 01:41:16 -0800

Dear Yaroslav, dear all,

hooray for simulations! :-)

I was not aware of the profound effect on classification performance if the 
groups are not perfectly balanced.
My tests using the 'npart' partitioner on clean and noisy test data showed the 
expected result (accuracy of 0.5 for non-signal voxels, 1 for the others). Cool!

Still, two questions remain:
a) Can I assess how the individual partitions look like (i.e. which subject is 
additionally removed to make the groups balanced)?

b) How do I deal with groups that have a larger imbalance? I've tried with my 
dummy data already: If I feed a dataset with imbalanced group sizes into the 
classification with 'npart' partitioner the result is random classification at 
all voxels. 
In my original data I have more participants in the second group than in the 
first, so I would need to restrict the size of the second group given the size 
of the first for each partition. My idea was to take everyone from group 1 and 
randomly pick the same number of participants from group 2 - what's the best 
way to realize this? 

Thanks a lot!
Ulrike

----- Original Message -----
From: "Yaroslav Halchenko" <[email protected]>
To: "pkg-exppsy-pymvpa" <[email protected]>
Sent: Tuesday, 17 November, 2015 16:18:20
Subject: Re: [pymvpa] Dataset with multidimensional feature vector per voxel

On Tue, 17 Nov 2015, Ulrike Kuhl wrote:

> Here you go:

> print DS_clean.summary()

> Dataset: 20x3375@float32, <sa: chunks,subject,targets>, <fa: 
> modality,modality_index,voxel_indices>, <a: mapper>
> stats: mean=0.006 std=0.0704271 var=0.00495998 min=0 max=1

> Counts of targets in each chunk:
>   chunks\targets  0   1
>                  --- ---
>         0         1   0
>         1         1   0
>         2         1   0
>         3         1   0
>         4         1   0
>         5         1   0
>         6         1   0
>         7         1   0
>         8         1   0
>         9         1   0
>        10         0   1
>        11         0   1
>        12         0   1
>        13         0   1
>        14         0   1
>        15         0   1
>        16         0   1
>        17         0   1
>        18         0   1
>        19         0   1

so the problem is that in each chunk you have only one sample and
overall you have only 20 samples to train on.  Whenever you NFold
partition it, you end up with 10 samples of one target and 9 of
another.   If there is a clear signal, error is minimized to correct
labeling.  If there is no signal, error is minimized to just always say
"class with majority of samples (10 vs 9)" which then always leads to
misclassification of the held out sample since it is of the opposite
class.

the fun was if you just ran it on real data -- most probably you would have got
some strong negative bias but possibly still some reasonable around chance
performances... and then would have scratched you head a lot.  So --
simulations rule! ;)

The simplest way to handle it: guarantee balanced number of samples from
both categories in training (and thus testing) splits.

There are two ways then to do it:

1. simplest  but more ad-hoc.  Group them all with chunks bringing two
samples from both classes together, so you end up with 10 chunks and
thus 10 splits if using NFold(1)

2. create a partitioner which would select all possible combinations
from the two classes, i.e. have 10*10=100 splits.

  Two ways to do it 

  a. with existing codebase smth like this should work

    npart = ChainNode([
        NFoldPartitioner(len(ds.sa['targets'].unique),
                         attr='chunks'),
        ## so it should select only those splits where we took 1 from
        ## each of the targets categories leaving things in balance
        Sifter([('partitions', 2),
                ('targets',
                 { 'uvalues': ds.sa['targets'].unique,
                   'balanced': True})
                ]),
    ], space='partitions')

 which will do in your case NFold(2) across chunks, thus select every
combination of two chunks, but then use only those (Sifter removes others)
which have balanced targets.

 b. WiP
  https://github.com/PyMVPA/PyMVPA/pull/386
  to simplify this so it would look like

   factpart = FactorialPartitioner(
       NFoldPartitioner(attr='chunks'),
       attr='targets'
   )

N.B. Matteo -- one more testcase to test! ;)

-- 
Yaroslav O. Halchenko
Center for Open Neuroscience     http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834                       Fax: +1 (603) 646-1419
WWW:   http://www.linkedin.com/in/yarik        

_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa
-- 
Max Planck Institute for Human Cognitive and Brain Sciences 
Department of Neuropsychology (A219) 
Stephanstraße 1a 
04103 Leipzig 

Phone: +49 (0) 341 9940 2625 
Mail: [email protected] 
Internet: http://www.cbs.mpg.de/staff/kuhl-12160

_______________________________________________
Pkg-ExpPsy-PyMVPA mailing list
[email protected]
http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa

Re: [pymvpa] Dataset with multidimensional feature vector per voxel

Reply via email to