If you don't mind writing some extra code, we've had good success with a
Monte Carlo implementation of a maximin diversity picker called BigPicker,
described in Blomberg et al, JCAMD, 23, 513-525 (2009).  With this
implementation, you only need to keep the subset distance matrix in memory.
At each step, one of the 2 molecules involved in the shortest subset
distance is swapped out, and a randomly chosen molecule from the pool
replaces it.  The relevant row/column of the subset distance matrix is
updated and the new minimum interdistance found.  A Monte Carlo criterion
is used to decide whether to accept the swap or not.  As the name suggests,
it can be used on very large datasets. Indeed, in our implementation we
allowed for the case where the subset was too large for the subset distance
matrix to be held in memory and the minimum distance was calculated from
fingerprints on the the fly at each step.  That really was slow, but if
it's the only way of solving the problem...  It's worth recognising that
this sort of algorithm spends a lot of time mucking about improving the
interdistance in the 4th or 5th decimal place. It's not clear that a subset
with a minimum interdistance of 0.41567 is definitively better than one of
0.41568, so a fairly loose convergence criterion is usually ok.  In our
experience a larger number of shorter runs, to avoid convergence on a bad
local minimum, is more reliable.

Having said all that, I'd be inclined to agree with Greg that if you're
only picking 200 compounds from 26000 you're probably going to do just as
well with a pin. You could be slightly cleverer by only accepting the next
random selection if it's above a threshold distance from anything you've
already selected to avoid the pathological case he describes.

Dave



On Thu, Jul 17, 2014 at 5:19 AM, Greg Landrum <greg.land...@gmail.com>
wrote:

> one other short thing.
> If this is the code you are using for the distance matrix:
>
> On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy <mla...@gmail.com> wrote:
>
>>
>> dm=[]
>> for i,fp in enumerate(zims_fps[:26000]):     # only 1000 in the demo (in
>> the interest of time)
>>
>> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[1+1:26000],returnDistance=True))
>> dm = array(dm)
>>
>
> Then at least part of the problem is that you are generating the full
> matrix. I think you intend to have:
>
> dm.extend(DataStructs.BulkTanimotoSimilarity(fp,zims_fps[i+1:26000],returnDistance=True))
> in there.
>
> That typo was in the original notebook that you used; I'm going to have to
> figure out how to fix that.
>
> -greg
>
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to