Thanks for the reference. That sort of bounds screening would probably work well in the C++ layer for the bulk similarity functions. My initial experiments without bounds screening found that doing individual similarity calculations in Python was a lot slower than the bulk function because moving from the Python to the C++ layer has quite a large overhead. It might be worth doing the screening in Python and I will try it but I suspect that the overhead will cancel out any gain brought by the screening process.
On Wed, 26 Oct 2022 at 04:17, S Joshua Swamidass <swamid...@gmail.com> wrote: > > I wonder if there is a way to make use of PyTorch or tensorflow to do this > on a GPU. That’s where some big speed ups might be found. > > Also, consider using these bounds. They do make a big difference in many > cases. > > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/ > > > On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger <mli...@ligand.eu> > wrote: > >> On 24/10/2022 19:47, David Cosgrove wrote: >> > For the record, I have attempted this, but got only a marginal >> > speed-up (130% of CPU used, with any number of threads above 2). The >> > procedure I used was to extract the fingerprint pointers into a >> > std::vector, create a std::vector for the results, unlock the GIL to >> > do the bulk tanimoto calculation, then re-lock the GIL to copy the >> > results from the std::vector into the python:list for output. I guess >> > the extra overhead to create and populate the additional std::vectors >> > destroyed any potential speedup. This was on a vector of 200K >> > fingerprints, which suggests that the Tanimoto calculation is a small >> > part of the overall time. It doesn't seem worth pursuing further. >> >> There is probably code on github doing this in parallel already. >> Think about it: any clustering algorithm using a distance matrix. >> I guess many people want to initialize the Gram matrix in parallel. >> >> I wouldn't be surprised if, for example, chemfp has such code. >> >> > Dave >> > >> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove >> > <davidacosgrov...@gmail.com> wrote: >> > >> >> Hi Greg, >> >> Thanks for the pointer. I’ll take a look. If it could go in the >> >> next patch release that would be really useful. >> >> Dave >> >> >> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com> >> >> wrote: >> >> >> >> Hi Dave, >> >> >> >> We have multiple examples of this in the code, here’s one: >> >> >> >> >> > >> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40 >> >> >> >> I’m not sure how this would interact with the call to >> >> Python::extract that’s in the bulk functions though >> >> >> >> It might be better to handle the multithreading on the C++ side by >> >> adding an optional nThreads argument to the bulk similarity >> >> functions. (Though this would have to wait for the next release >> >> since it’s a feature addition… we can declare releasing the GIL >> >> as a bug fix) >> >> >> >> -greg >> >> >> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove >> >> <davidacosgrov...@gmail.com> wrote: >> >> >> >> Hi, >> >> >> >> I'm doing a lot of tanimoto similarity calculations on large >> >> datasets using BulkTanimotoSimilarity. It is an obvious candidate >> >> for parallelisation, so I am using concurrent.futures to do so. If >> >> I use ProcessPoolExectuor, I get good speed-up but each process >> >> needs a copy of the fingerprint set and for the sizes I'm dealing >> >> with that uses too much memory. With ThreadPoolExecutor I only need >> >> 1 copy of the fingerprints, but the GIL means it only runs on 1 >> >> thread at a time so there's no gain. Would it be possible to amend >> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the >> >> calculation, and recapture it afterwards? I understand things like >> >> numpy do this for some of their functions. I'm happy to attempt it >> >> myself if someone who knows about these things can advise that it >> >> could be done, it would help, and they could provide a few pointers. >> >> >> >> Thanks, >> >> Dave >> >> >> >> -- >> >> >> >> David Cosgrove >> >> Freelance computational chemistry and chemoinformatics developer >> >> http://cozchemix.co.uk >> >> >> >> _______________________________________________ >> >> Rdkit-discuss mailing list >> >> Rdkit-discuss@lists.sourceforge.net >> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- >> > >> > David Cosgrove >> > Freelance computational chemistry and chemoinformatics developer >> > http://cozchemix.co.uk >> > >> > -- >> > >> > David Cosgrove >> > Freelance computational chemistry and chemoinformatics developer >> > http://cozchemix.co.uk >> > _______________________________________________ >> > Rdkit-discuss mailing list >> > Rdkit-discuss@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > -- > Sent from Gmail Mobile > -- David Cosgrove Freelance computational chemistry and chemoinformatics developer http://cozchemix.co.uk
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss