For the record, I have attempted this, but got only a marginal speed-up (130% of CPU used, with any number of threads above 2). The procedure I used was to extract the fingerprint pointers into a std::vector, create a std::vector for the results, unlock the GIL to do the bulk tanimoto calculation, then re-lock the GIL to copy the results from the std::vector into the python:list for output. I guess the extra overhead to create and populate the additional std::vectors destroyed any potential speedup. This was on a vector of 200K fingerprints, which suggests that the Tanimoto calculation is a small part of the overall time. It doesn't seem worth pursuing further.
Dave On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove <davidacosgrov...@gmail.com> wrote: > Hi Greg, > Thanks for the pointer. I’ll take a look. If it could go in the next patch > release that would be really useful. > Dave > > > On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com> wrote: > >> >> Hi Dave, >> >> We have multiple examples of this in the code, here’s one: >> >> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 >> >> I’m not sure how this would interact with the call to Python::extract >> that’s in the bulk functions though >> >> It might be better to handle the multithreading on the C++ side by adding >> an optional nThreads argument to the bulk similarity functions. (Though >> this would have to wait for the next release since it’s a feature addition… >> we can declare releasing the GIL as a bug fix) >> >> -greg >> >> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove <davidacosgrov...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I'm doing a lot of tanimoto similarity calculations on large datasets >>> using BulkTanimotoSimilarity. It is an obvious candidate for >>> parallelisation, so I am using concurrent.futures to do so. If I use >>> ProcessPoolExectuor, I get good speed-up but each process needs a copy of >>> the fingerprint set and for the sizes I'm dealing with that uses too much >>> memory. With ThreadPoolExecutor I only need 1 copy of the fingerprints, >>> but the GIL means it only runs on 1 thread at a time so there's no gain. >>> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the >>> GIL whilst it's doing the calculation, and recapture it afterwards? I >>> understand things like numpy do this for some of their functions. I'm >>> happy to attempt it myself if someone who knows about these things can >>> advise that it could be done, it would help, and they could provide a few >>> pointers. >>> >>> Thanks, >>> Dave >>> >>> >>> -- >>> David Cosgrove >>> Freelance computational chemistry and chemoinformatics developer >>> http://cozchemix.co.uk >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > http://cozchemix.co.uk > > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss