Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

David Cosgrove Wed, 26 Oct 2022 14:47:51 -0700

Thanks for the reference. That sort of bounds screening would probably work
well in the C++ layer for the bulk similarity functions.  My initial
experiments without bounds screening found that doing individual similarity
calculations in Python was a lot slower than the bulk function because
moving from the Python to the C++ layer has quite a large overhead. It
might be worth doing the screening in Python and I will try it but I
suspect that the overhead will cancel out any gain brought by the screening
process.


On Wed, 26 Oct 2022 at 04:17, S Joshua Swamidass <[email protected]>
wrote:

>
> I wonder if there is a way to make use of PyTorch or tensorflow to do this
> on a GPU. That’s where some big speed ups might be found.
>
> Also, consider using these bounds. They do make a big difference in many
> cases.
>
> https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2527184/
>
>
> On Tue, Oct 25, 2022 at 8:57 PM Francois Berenger <[email protected]>
> wrote:
>
>> On 24/10/2022 19:47, David Cosgrove wrote:
>> > For the record, I have attempted this, but got only a marginal
>> > speed-up (130% of CPU used, with any number of threads above 2).  The
>> > procedure I used was to extract the fingerprint pointers into a
>> > std::vector, create a std::vector for the results, unlock the GIL to
>> > do the bulk tanimoto calculation, then re-lock the GIL to copy the
>> > results from the std::vector into the python:list for output.  I guess
>> > the extra overhead to create and populate the additional std::vectors
>> > destroyed any potential speedup.  This was on a vector of 200K
>> > fingerprints, which suggests that the Tanimoto calculation is a small
>> > part of the overall time.  It doesn't seem worth pursuing further.
>>
>> There is probably code on github doing this in parallel already.
>> Think about it: any clustering algorithm using a distance matrix.
>> I guess many people want to initialize the Gram matrix in parallel.
>>
>> I wouldn't be surprised if, for example, chemfp has such code.
>>
>> > Dave
>> >
>> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
>> > <[email protected]> wrote:
>> >
>> >> Hi Greg,
>> >> Thanks for the pointer. I’ll take a look. If it could go in the
>> >> next patch release that would be really useful.
>> >> Dave
>> >>
>> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum <[email protected]>
>> >> wrote:
>> >>
>> >> Hi Dave,
>> >>
>> >> We have multiple examples of this in the code, here’s one:
>> >>
>> >>
>> >
>> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
>> >>
>> >> I’m not sure how this would interact with the call to
>> >> Python::extract that’s in the bulk functions though
>> >>
>> >> It might be better to handle the multithreading on the C++ side by
>> >> adding an optional nThreads argument to  the bulk similarity
>> >> functions. (Though this would have to wait for the next release
>> >> since it’s a feature addition… we can declare releasing the GIL
>> >> as a bug fix)
>> >>
>> >> -greg
>> >>
>> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove
>> >> <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I'm doing a lot of tanimoto similarity calculations on large
>> >> datasets using BulkTanimotoSimilarity.  It is an obvious candidate
>> >> for parallelisation, so I am using concurrent.futures to do so.  If
>> >> I use ProcessPoolExectuor, I get good speed-up but each process
>> >> needs a copy of the fingerprint set and for the sizes I'm dealing
>> >> with that uses too much memory.  With ThreadPoolExecutor I only need
>> >> 1 copy of the fingerprints, but the GIL means it only runs on 1
>> >> thread at a time so there's no gain.  Would it be possible to amend
>> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
>> >> calculation, and recapture it afterwards?  I understand things like
>> >> numpy do this for some of their functions.  I'm happy to attempt it
>> >> myself if someone who knows about these things can advise that it
>> >> could be done, it would help, and they could provide a few pointers.
>> >>
>> >> Thanks,
>> >> Dave
>> >>
>> >> --
>> >>
>> >> David Cosgrove
>> >> Freelance computational chemistry and chemoinformatics developer
>> >> http://cozchemix.co.uk
>> >>
>> >> _______________________________________________
>> >> Rdkit-discuss mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> >  --
>> >
>> > David Cosgrove
>> > Freelance computational chemistry and chemoinformatics developer
>> > http://cozchemix.co.uk
>> >
>> > --
>> >
>> > David Cosgrove
>> > Freelance computational chemistry and chemoinformatics developer
>> > http://cozchemix.co.uk
>> > _______________________________________________
>> > Rdkit-discuss mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> --
> Sent from Gmail Mobile
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] GIL Lock in BulkTanimotoSimilarity

Reply via email to