For the record, I have attempted this, but got only a marginal speed-up
(130% of CPU used, with any number of threads above 2).  The procedure I
used was to extract the fingerprint pointers into a std::vector, create a
std::vector for the results, unlock the GIL to do the bulk tanimoto
calculation, then re-lock the GIL to copy the results from the std::vector
into the python:list for output.  I guess the extra overhead to create and
populate the additional std::vectors destroyed any potential speedup.  This
was on a vector of 200K fingerprints, which suggests that the Tanimoto
calculation is a small part of the overall time.  It doesn't seem worth
pursuing further.

Dave


On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove <davidacosgrov...@gmail.com>
wrote:

> Hi Greg,
> Thanks for the pointer. I’ll take a look. If it could go in the next patch
> release that would be really useful.
> Dave
>
>
> On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com> wrote:
>
>>
>> Hi Dave,
>>
>> We have multiple examples of this in the code, here’s one:
>>
>> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
>>
>> I’m not sure how this would interact with the call to Python::extract
>> that’s in the bulk functions though
>>
>> It might be better to handle the multithreading on the C++ side by adding
>> an optional nThreads argument to  the bulk similarity functions. (Though
>> this would have to wait for the next release since it’s a feature addition…
>> we can declare releasing the GIL as a bug fix)
>>
>> -greg
>>
>>
>> On Sat, 22 Oct 2022 at 09:48, David Cosgrove <davidacosgrov...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'm doing a lot of tanimoto similarity calculations on large datasets
>>> using BulkTanimotoSimilarity.  It is an obvious candidate for
>>> parallelisation, so I am using concurrent.futures to do so.  If I use
>>> ProcessPoolExectuor, I get good speed-up but each process needs a copy of
>>> the fingerprint set and for the sizes I'm dealing with that uses too much
>>> memory.  With ThreadPoolExecutor I only need 1 copy of the fingerprints,
>>> but the GIL means it only runs on 1 thread at a time so there's no gain.
>>> Would it be possible to amend the C++ BulkTanimotoSimilarity to free the
>>> GIL whilst it's doing the calculation, and recapture it afterwards?  I
>>> understand things like numpy do this for some of their functions.  I'm
>>> happy to attempt it myself if someone who knows about these things can
>>> advise that it could be done, it would help, and they could provide a few
>>> pointers.
>>>
>>> Thanks,
>>> Dave
>>>
>>>
>>> --
>>> David Cosgrove
>>> Freelance computational chemistry and chemoinformatics developer
>>> http://cozchemix.co.uk
>>>
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>

-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to