FWIW: I checked in a lazy bit vector picker this weekend. This still uses
the MaxMin algorithm, but does not require either a python callback
function or pre-computation of the distance matrix.
There's some sample code for using it (along with timing data) here:
If you don't mind writing some extra code, we've had good success with a
Monte Carlo implementation of a maximin diversity picker called BigPicker,
described in Blomberg et al, JCAMD, 23, 513-525 (2009). With this
implementation, you only need to keep the subset distance matrix in memory.
At each
Hi Greg,
Thanks! My Python is really rusty at the moment, so I am always unsure if
I am just not going through the steps in the most efficient manner, or if
the path that I am following is far from ideal. Granted I would prefer to
write this in Java, but I should probably move back towards C++
Hi Dave,
That's interesting, and I'll look into it. As I wrote Greg, I am an
aficionado of the Gobbi Lee method described here (since we are sharing
our favourite methods):
http://pubs.acs.org/doi/abs/10.1021/ci025554v
While the set I am looking at the moment contains only 26K molecules, I
Hi all,
I have been playing with the diversity selection in RDKit. I am running
through a set of ~26,000 molecules to pick a set of 200 diverse molecules.
I saw some examples of how to do this in Python (my variant of their script
below), but the memory consumption is massive. I burned through
Matthew,
Two lines of shameless self-promotion:
This is exactly the kind of problem for Diversity Genie -
http://www.diversitygenie.com/
It is using RDKit library underneath, but wraps it in a simple, easy to use
GUI front-end.
Best regards,
Igor
On Wed, Jul 16, 2014 at 6:18 PM, Matthew Lardy
Hi Igor,
Thanks! Maybe I am a throwback, but I prefer the command line to a GUI.
Still I'll give it a whirl! :)
If you are handling millions of molecules without issue; then my Python
skills are really, really, rusty. Or, I shouldn't be using Python to
handle this much data. :)
Thanks for
Try using parentheses instead of square brackets. This converts lists to
generators https://wiki.python.org/moin/Generators, which will take up
almost no memory.
Haven’t tested it, but here’s how it would impact your code:
from rdkit import Chemfrom rdkit.Chem import AllChemfrom rdkit import
Hi Markus,
It looks like the memory consumption (initially) drops. Still it gets out
of control, likely after the file is read.
Here is the file info:
-rw-rw-r--. 1 mlardy mlardy 1.6M Jul 16 16:40 a.sdf.gz
Looking into Patrick's suggestion, I got the first error:
NameError: name 'array' is not
On Thu, Jul 17, 2014 at 1:58 AM, Matthew Lardy mla...@gmail.com wrote:
It looks like the memory consumption (initially) drops. Still it gets out
of control, likely after the file is read.
That is, most likely, due to the fact that distance matrix itself is huge.
Still, 26K molecules should
one other short thing.
If this is the code you are using for the distance matrix:
On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy mla...@gmail.com wrote:
dm=[]
for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo (in
the interest of time)
11 matches
Mail list logo