> This should run 5 evaluation, using 1/5 of the available data each time > to test the classifier. Correct?
correct in that it should generate for you 5 partitions, where in first one you would obtain nsamples/5 first samples (and corresponding "chunks" unique per each sample in your case) > Now, for this to work properly, it requires that targets are properly > randomly distributed in the dataset... well... theoretically speaking, if you have lots of samples, you might escape by doing classical leave-1-out cross-validation. That would be implemented by using NFoldPartitioner on your dataset (ie without NGroupPartitioner). But it would take a while to do such a cross-validation -- might be not desired unless coded explicitly for it (e.g. for SVMs either using CachedKernel to avoid recomputation of kernels, or even more trickery...) > for instance if the last 1/5 of > the samples only contain target 2, then it won't work... yeap -- that is the catch ;) you could use NFoldPartitioner(cvtype=2) which would combine all possible combinations of 2 chunks with a consecutive Sifter (recently introduced) to get only those partitions which carry labels from both classes, but, once again, it would be A LOT to cross-validate (i.e. roughly (nsamples/2)^2), so I guess not solution for you either > What do you suggest to solve this problem? If you have some certainty that samples are independent, then to get reasonable generalization estimate, just assign np.arange(nsamples/2) (assuming balanced initially classes) as chunks to samples per each condition. Then in each chunk, you would guarantee to have a pair of conditions ;) And then you are welcome to use NGroupPartitioner to bring number of partitions to some more cost-effective number , e.g. 10. > I have tried to use a ChainNode, > chaining the NGroupPartitioner and a Balancer but it didn't work, if I see it right -- it should have worked, unless you had really degenerate case, e.g. one of partitions contained samples only of 1 category. > apparently due to a bug in Balancer (see another mail on that one). oops -- need to check emails then... > My main question though is: it seems weird to add chunks attribute like > this. Is it the correct way? well... if you consider your samples independent from each other, then yes -- it is reasonable to assign each sample into a separate chunk. > Btw, is there a way to pick at random 80% of the data (with equal > number of samples for each target) for training and the remaining 20% > for testing, and repeat this as many times as I want to obtain a > consistant result? although I think we haven't tried it, but this should do: CrossValidation(clf, Balancer(amount=0.8, limit=None, attr='targets', count=3), splitter=Splitter('balanced_set', [True, False])) should do cross-validation by taking 3 of those (raise 3 to the number you like). what we do here -- we say to balance targets, take 80% and mark them True, while other 20% False. Then we proceed to the cross-validation. That thing uses an actual splitter which splits dataset into training/testing parts. Usually such splitter is not specified and constructed by CrossValidation assuming operation on partitions labeled as 0,1 (and possibly 2) usually provided by Partitioners. But now we want to split based on balanced_set -- and we can do that, and instruct it to take 80% True for training, and the rest (False) for testing. limit=None is there to say to not limit subsampling to any attribute (commonly chunks), so in this case you don't even need to have chunks at all. is that what you needed? -- =------------------------------------------------------------------= Keep in touch www.onerussian.com Yaroslav Halchenko www.ohloh.net/accounts/yarikoptic _______________________________________________ Pkg-ExpPsy-PyMVPA mailing list Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org http://lists.alioth.debian.org/mailman/listinfo/pkg-exppsy-pymvpa