I would be very surprised if speed of fingerprint similarity was the
limiting factor on a distance- matrix-based clustering method. Normally
they are constrained by memory requirements.  In this case I am using the
MaxMin picker in RDKit to generate the cluster “centroids” and am wanting
to fill the clusters with the neighbours to each “centroid”. That is
obviously an embarrassingly parallel calculation, but the size of my
dataset means that using standard Python multi-processing takes up too much
memory because each parallel process needs its own copy of the fingerprint
set. I was wondering if releasing the GIL would make it possible to do it
in multithreaded mode with a single shared fingerprint set but the answer
was that it made little improvement on a single-thread run.


On Wed, 26 Oct 2022 at 02:32, Francois Berenger <mli...@ligand.eu> wrote:

> On 24/10/2022 19:47, David Cosgrove wrote:
> > For the record, I have attempted this, but got only a marginal
> > speed-up (130% of CPU used, with any number of threads above 2).  The
> > procedure I used was to extract the fingerprint pointers into a
> > std::vector, create a std::vector for the results, unlock the GIL to
> > do the bulk tanimoto calculation, then re-lock the GIL to copy the
> > results from the std::vector into the python:list for output.  I guess
> > the extra overhead to create and populate the additional std::vectors
> > destroyed any potential speedup.  This was on a vector of 200K
> > fingerprints, which suggests that the Tanimoto calculation is a small
> > part of the overall time.  It doesn't seem worth pursuing further.
>
> There is probably code on github doing this in parallel already.
> Think about it: any clustering algorithm using a distance matrix.
> I guess many people want to initialize the Gram matrix in parallel.
>
> I wouldn't be surprised if, for example, chemfp has such code.
>
> > Dave
> >
> > On Sat, Oct 22, 2022 at 11:28 AM David Cosgrove
> > <davidacosgrov...@gmail.com> wrote:
> >
> >> Hi Greg,
> >> Thanks for the pointer. I’ll take a look. If it could go in the
> >> next patch release that would be really useful.
> >> Dave
> >>
> >> On Sat, 22 Oct 2022 at 10:52, Greg Landrum <greg.land...@gmail.com>
> >> wrote:
> >>
> >> Hi Dave,
> >>
> >> We have multiple examples of this in the code, here’s one:
> >>
> >>
> >
> https://github.com/rdkit/rdkit/blob/b208da471f8edc88e07c77ed7d7868649ac75100/Code/GraphMol/ForceFieldHelpers/Wrap/rdForceFields.cpp#L40
> >>
> >> I’m not sure how this would interact with the call to
> >> Python::extract that’s in the bulk functions though
> >>
> >> It might be better to handle the multithreading on the C++ side by
> >> adding an optional nThreads argument to  the bulk similarity
> >> functions. (Though this would have to wait for the next release
> >> since it’s a feature addition… we can declare releasing the GIL
> >> as a bug fix)
> >>
> >> -greg
> >>
> >> On Sat, 22 Oct 2022 at 09:48, David Cosgrove
> >> <davidacosgrov...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I'm doing a lot of tanimoto similarity calculations on large
> >> datasets using BulkTanimotoSimilarity.  It is an obvious candidate
> >> for parallelisation, so I am using concurrent.futures to do so.  If
> >> I use ProcessPoolExectuor, I get good speed-up but each process
> >> needs a copy of the fingerprint set and for the sizes I'm dealing
> >> with that uses too much memory.  With ThreadPoolExecutor I only need
> >> 1 copy of the fingerprints, but the GIL means it only runs on 1
> >> thread at a time so there's no gain.  Would it be possible to amend
> >> the C++ BulkTanimotoSimilarity to free the GIL whilst it's doing the
> >> calculation, and recapture it afterwards?  I understand things like
> >> numpy do this for some of their functions.  I'm happy to attempt it
> >> myself if someone who knows about these things can advise that it
> >> could be done, it would help, and they could provide a few pointers.
> >>
> >> Thanks,
> >> Dave
> >>
> >> --
> >>
> >> David Cosgrove
> >> Freelance computational chemistry and chemoinformatics developer
> >> http://cozchemix.co.uk
> >>
> >> _______________________________________________
> >> Rdkit-discuss mailing list
> >> Rdkit-discuss@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >  --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> >
> > --
> >
> > David Cosgrove
> > Freelance computational chemistry and chemoinformatics developer
> > http://cozchemix.co.uk
> > _______________________________________________
> > Rdkit-discuss mailing list
> > Rdkit-discuss@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to