Hi Marco, some ideas in random order
- what version of sklearn/joblib are you using? I would make sure to use the latest version (0.11), perhaps not importing it from sklearn (unless you have the latest sklearn version, 0.19.1) - are you running the code in a jupyter notebook? There might be some issues with that (see https://github.com/joblib/joblib/issues/174). As a test you might try to convert your notebook to a script and then run it On Thu, Nov 23, 2017 at 6:07 AM, marco tettamanti <mrcttt...@gmail.com> wrote: > Dear Matteo (and others), > sorry, I am again asking for your help! > > I have experimented with the analysis of my dataset using an adaptation of > your joblib-based gist. > As I wrote before, it works perfectly, but not with some classifiers: SVM > classifiers always cause the code to terminate with an error. > > If I set: > myclassif=clfswh['!gnpp','!skl','svm'] #Note that 'gnnp' and > 'skl' were excluded for independent reasons > the code runs through without errors. > > However, with: > myclassif=clfswh['!gnpp','!skl'] > I get the following error: > MaybeEncodingError: Error sending result: > '[TransferMeasure(measure=SVM(svm_impl='C_SVC', kernel=LinearLSKernel(), > weight=[], probability=1, > weight_label=[]), splitter=Splitter(space='partitions'), > postproc=BinaryFxNode(space='targets'), enable_ca=['stats'])]'. Reason: > 'TypeError("can't > pickle SwigPyObject objects",)' > > After googling for what may cause this particular error, I have found that > the situation improves slightly (i.e. more splits executed, sometimes even > all splits) by importing the following: > import os > from sklearn.externals.joblib import Parallel, delayed > from sklearn.externals.joblib.parallel import parallel_backend > and then specifying just before 'Parallel(n_jobs=2)': > with parallel_backend('threading'): > However, also in this case, the code invariably terminates with a long > error message (I only report an extract, but in case I can send the whole > error message): > <type 'str'>: (<type 'exceptions.UnicodeEncodeError'>, > UnicodeEncodeError('ascii', > u'JoblibAttributeError\n____________________________________ > _______________________________________\nMultiprocessing > exception:\n................................................ > ...........................\n/usr/lib/python2.7/runpy.py in > _run_module_as_main(mod_name=\'ipykernel_launcher\', > alter_argv=1)\n 169 pkg_name = mod_name.rpartition(\'.\')[0]\n > 170 > main_globals = sys.modules["__main__"].__dict__\n 171 if > alter_argv:\n 172 sys.argv[0] = fname\n 173 return > _run_code(code, > main_globals, None,\n--> 174 > > > I think I have sort of understood that the problem is due to some failure > in pickling the parallelized jobs, but I have no clues if and how it can be > solved. > Do you have any suggestions? > > Thank you and very best wishes, > Marco > > p.s. This is again the full code: > > > ########## * ########## > ########## > > PyMVPA: > Version: 2.6.3 > Hash: 9c07e8827819aaa79ff15d2db10c420a876d7785 > Path: /usr/lib/python2.7/dist-packages/mvpa2/__init__.pyc > Version control (GIT): > GIT information could not be obtained due > "/usr/lib/python2.7/dist-packages/mvpa2/.. > is not under GIT" > SYSTEM: > OS: posix Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2 > (2017-10-15) > > > print fds.summary() > Dataset: 36x534@float32, <sa: chunks,targets,time_coords,time_indices>, > <fa: voxel_indices>, <a: imgaffine,imghdr,imgtype, > mapper,voxel_dim,voxel_eldim> > stats: mean=0.548448 std=1.40906 var=1.98546 min=-5.41163 max=9.88639 > No details due to large number of targets or chunks. Increase maxc and > maxt if desired > Summary for targets across chunks > targets mean std min max #chunks > C 0.5 0.5 0 1 18 > D 0.5 0.5 0 1 18 > > > #Evaluate prevalent best classifier with nested crossvalidation > verbose.level = 5 > > partitionerCD = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'), > Sifter([('partitions', 2), ('targets', ['C', 'D'])])], space='partitions') > # training partitions > for fds_ in partitionerCD.generate(fds): > training = fds[fds_.sa.partitions == 1] > #print list(zip(training.sa.chunks, training.sa.targets)) > # testing partitions > for fds_ in partitionerCD.generate(fds): > testing = fds[fds_.sa.partitions == 2] > #print list(zip(testing.sa.chunks, testing.sa.targets)) > > #Helper function (partitionerCD recursively acting on dstrain, rather than > on fds): > def select_best_clf(dstrain_, clfs): > """Select best model according to CVTE > Helper function which we will use twice -- once for proper nested > cross-validation, and once to see how big an optimistic bias due > to model selection could be if we simply provide an entire dataset. > Parameters > ---------- > dstrain_ : Dataset > clfs : list of Classifiers > Which classifiers to explore > Returns > ------- > best_clf, best_error > """ > best_error = None > for clf in clfs: > cv = CrossValidation(clf, partitionerCD) > # unfortunately we don't have ability to reassign clf atm > # cv.transerror.clf = clf > try: > error = np.mean(cv(dstrain_)) > except LearnerError, e: > # skip the classifier if data was not appropriate and it > # failed to learn/predict at all > continue > if best_error is None or error < best_error: > best_clf = clf > best_error = error > verbose(4, "Classifier %s cv error=%.2f" % (clf.descr, error)) > verbose(3, "Selected the best out of %i classifiers %s with error %.2f" > % (len(clfs), best_clf.descr, best_error)) > return best_clf, best_error > > # This function will run all classifiers for one single partitions > myclassif=clfswh['!gnpp','!skl'][5:6] #Testing a single SVM classifier > def _run_one_partition(isplit, partitions, classifiers=myclassif): #see ยงยง > verbose(2, "Processing split #%i" % isplit) > dstrain, dstest = list(splitter.generate(partitions)) > best_clf, best_error = select_best_clf(dstrain, classifiers) > # now that we have the best classifier, lets assess its transfer > # to the testing dataset while training on entire training > tm = TransferMeasure(best_clf, splitter,postproc= > BinaryFxNode(mean_mismatch_error,space='targets'), enable_ca=['stats']) > tm(partitions) > return tm > > #import os > #from sklearn.externals.joblib import Parallel, delayed > #from sklearn.externals.joblib.parallel import parallel_backend > > # Parallel estimate error using nested CV for model selection > confusion = ConfusionMatrix() > verbose(1, "Estimating error using nested CV for model selection") > partitioner = partitionerCD > splitter = Splitter('partitions') > # Here we are using joblib Parallel to parallelize each partition > # Set n_jobs to the number of available cores (or how many you want to use) > #with parallel_backend('threading'): > # tms = Parallel(n_jobs=2)(delayed(_run_one_partition)(isplit, > partitions) > tms = Parallel(n_jobs=2)(delayed(_run_one_partition)(isplit, partitions) > for isplit, partitions in enumerate(partitionerCD. > generate(fds))) > # Parallel retuns a list with the results of each parallel loop, so we > need to > # unravel it to get the confusion matrix > best_clfs = {} > for tm in tms: > confusion += tm.ca.stats > best_clfs[tm.measure.descr] = best_clfs.get(tm.measure.descr, 0) + 1 > > ########## > ########## * ########## > > > > > > > > > On 13/11/2017 09:12, marco tettamanti wrote: > > Dear Matteo, > grazie mille, this is precisely the kind of thing I was looking for: it > works like charm! > Ciao, > Marco > > On 11/11/2017 21:44, Matteo Visconti di Oleggio Castello wrote: > > Hi Marco, > > in your case, I would then recommend looking into joblib to parallelize > your for loops (https://pythonhosted.org/joblib/parallel.html). > > As an example, here's a gist containing part of the PyMVPA's nested_cv > example where I parallelized the loop across partitions. I feel this is > what you might want to do in your case, since you have a lot more folds. > > Here's the gist: > https://gist.github.com/mvdoc/0c2574079dfde78ea649e7dc0a3feab0 > > > On 10/11/2017 21:13, marco tettamanti wrote: > > Dear Matteo, > thank you for the willingness to look into my code. > > This is taken almost verbatim from http://dev.pymvpa.org/ > examples/nested_cv.html, except for the leave-one-pair-out partitioning, > and a slight reduction in the number of classifiers (in the original > example, they are around 45). > > Any help or suggestion would be greatly appreciated! > All the best, > Marco > > > ########## * ########## > ########## > > PyMVPA: > Version: 2.6.3 > Hash: 9c07e8827819aaa79ff15d2db10c420a876d7785 > Path: /usr/lib/python2.7/dist-packages/mvpa2/__init__.pyc > Version control (GIT): > GIT information could not be obtained due > "/usr/lib/python2.7/dist-packages/mvpa2/.. > is not under GIT" > SYSTEM: > OS: posix Linux 4.13.0-1-amd64 #1 SMP Debian 4.13.4-2 > (2017-10-15) > > > print fds.summary() > Dataset: 36x534@float32, <sa: chunks,targets,time_coords,time_indices>, > <fa: voxel_indices>, <a: imgaffine,imghdr,imgtype, > mapper,voxel_dim,voxel_eldim> > stats: mean=0.548448 std=1.40906 var=1.98546 min=-5.41163 max=9.88639 > No details due to large number of targets or chunks. Increase maxc and > maxt if desired > Summary for targets across chunks > targets mean std min max #chunks > C 0.5 0.5 0 1 18 > D 0.5 0.5 0 1 18 > > > #Evaluate prevalent best classifier with nested crossvalidation > verbose.level = 5 > > partitionerCD = ChainNode([NFoldPartitioner(cvtype=2, attr='chunks'), > Sifter([('partitions', 2), ('targets', ['C', 'D'])])], space='partitions') > # training partitions > for fds_ in partitionerCD.generate(fds): > training = fds[fds_.sa.partitions == 1] > #print list(zip(training.sa.chunks, training.sa.targets)) > # testing partitions > for fds_ in partitionerCD.generate(fds): > testing = fds[fds_.sa.partitions == 2] > #print list(zip(testing.sa.chunks, testing.sa.targets)) > > #Helper function (partitionerCD recursively acting on dstrain, rather than > on fds): > def select_best_clf(dstrain_, clfs): > """Select best model according to CVTE > Helper function which we will use twice -- once for proper nested > cross-validation, and once to see how big an optimistic bias due > to model selection could be if we simply provide an entire dataset. > Parameters > ---------- > dstrain_ : Dataset > clfs : list of Classifiers > Which classifiers to explore > Returns > ------- > best_clf, best_error > """ > best_error = None > for clf in clfs: > cv = CrossValidation(clf, partitionerCD) > # unfortunately we don't have ability to reassign clf atm > # cv.transerror.clf = clf > try: > error = np.mean(cv(dstrain_)) > except LearnerError, e: > # skip the classifier if data was not appropriate and it > # failed to learn/predict at all > continue > if best_error is None or error < best_error: > best_clf = clf > best_error = error > verbose(4, "Classifier %s cv error=%.2f" % (clf.descr, error)) > verbose(3, "Selected the best out of %i classifiers %s with error %.2f" > % (len(clfs), best_clf.descr, best_error)) > return best_clf, best_error > > #Estimate error using nested CV for model selection: > best_clfs = {} > confusion = ConfusionMatrix() > verbose(1, "Estimating error using nested CV for model selection") > partitioner = partitionerCD > splitter = Splitter('partitions') > for isplit, partitions in enumerate(partitionerCD.generate(fds)): > verbose(2, "Processing split #%i" % isplit) > dstrain, dstest = list(splitter.generate(partitions)) > best_clf, best_error = select_best_clf(dstrain, clfswh['!gnpp','!skl']) > best_clfs[best_clf.descr] = best_clfs.get(best_clf.descr, 0) + 1 > # now that we have the best classifier, lets assess its transfer > # to the testing dataset while training on entire training > tm = TransferMeasure(best_clf, splitter, > postproc=BinaryFxNode(mean_mismatch_error, > space='targets'), enable_ca=['stats']) > tm(partitions) > confusion += tm.ca.stats > > ########## > ########## * ########## > > > > > > > On 10/11/2017 15:43, Matteo Visconti di Oleggio Castello wrote: > > What do you mean with "cycling over approx 40 different classifiers"? Are > you testing different classifiers? If that's the case, a possibility is to > create a script that takes as argument the type of classifiers and runs the > classification across all folds. In that way you can submit 40 jobs and > parallelize across classifiers. > > If that's not the case, because the folds are independent and deterministic > I would create a script that performs the classification on blocks of folds > (say fold 1 to 30, 31, to 60, etc...), and then submit different jobs, so > to parallelize there. > > I think that if you send a snippet of the code you're using it can be more > evident which are good points for parallelization. > > > On 10/11/2017 09:57, marco tettamanti wrote: > > Dear Matteo and Nick, > thank you for your responses. > I take the occasion to ask some follow-up questions, because I am struggling > to > make pymvpa2 computations faster and more efficient. > > I often find myself in the situation of giving up with a particular analysis, > because it is going to take far more time that I can bear (weeks, months!). > This > happens particularly with searchlight permutation testing (gnbsearchlight is > much faster, but does not support pprocess), and nested cross-validation. > As for the latter, for example, I recently wanted to run nested > cross-validation > in a sample of 18 patients and 18 controls (1 image x subject), training the > classifiers to discriminate patients from controls in a leave-one-pair-out > partitioning scheme. This yields 18*18=324 folds. For a small ROI of 36 > voxels, > cycling over approx 40 different classifiers takes about 2 hours for each fold > on a decent PowerEdge T430 Dell server with 128GB RAM. This means approx. 27 > days for all 324 folds! > The same server is equipped with 32 CPUs. With full parallelization, the same > analysis may be completed in less than one day. This is the reason of my > interest and questions about parallelization. > > Is there anything that you experts do in such situations to speed up or make > the > computation more efficient? > > Thank you again and best wishes, > Marco > > > > On 10/11/2017 10:07, Nick Oosterhof wrote: > > There have been some plans / minor attempts for using parallelisation more > parallel, but as far as I know we only support pprocces, and only for (1) > searchlight; (2) surface-based voxel selection; and (3) hyperalignment. I > do remember that parallelisation of other functions was challenging due to > some getting the conditional attributes set right, but this is long time > ago. > > > On 09/11/2017 18:35, Matteo Visconti di Oleggio Castello wrote: > > Hi Marco, > AFAIK, there is no support for parallelization at the level of > cross-validation. Usually for a small ROI (such a searchlight) and with > standard CV schemes, the process is quite fast, and the bottleneck is > really the number of searchlights to be computed (for which parallelization > exists). > > In my experience, we tend to parallelize at the level of individual > participants; for example we might set up a searchlight analysis with > however n_procs you can have, and then submit one such job for every > participant to a cluster (using either torque or condor). > > HTH, > Matteo > > On 09/11/2017 10:08, marco tettamanti wrote: > > Dear all, > forgive me if this has already been asked in the past, but I was wondering > whether there has been any development meanwhile. > > Are there any chances that one can generally apply parallel computing > (multiple > CPUs or clusters) with pymvpa2, in addition to what is already implemented for > searchlight (pprocess)? That is, also for general cross-validation, nested > cross-validation, permutation testing, RFE, etc.? > > Has anyone had succesful experience with parallelization schemes such as > ipyparallel, condor or else? > > Thank you and best wishes! > Marco > > > > -- > Marco Tettamanti, Ph.D. > Nuclear Medicine Department & Division of Neuroscience > IRCCS San Raffaele Scientific Institute > Via Olgettina 58 > I-20132 Milano, Italy > Phone ++39-02-26434888 <+39%2002%202643%204888> > Fax ++39-02-26434892 <+39%2002%202643%204892> > Email: tettamanti.ma...@hsr.it > Skype: mtettamantihttp://scholar.google.it/citations?user=x4qQl4AAAAAJ > > > > > _______________________________________________ > Pkg-ExpPsy-PyMVPA mailing list > Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa > -- Matteo Visconti di Oleggio Castello Ph.D. Candidate in Cognitive Neuroscience Dartmouth College +1 (603) 646-8665 mvdoc.me || github.com/mvdoc || linkedin.com/in/matteovisconti
_______________________________________________ Pkg-ExpPsy-PyMVPA mailing list Pkg-ExpPsy-PyMVPA@lists.alioth.debian.org http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/pkg-exppsy-pymvpa