Re: [Rdkit-discuss] MaxMin Picker and Python

2014-08-11 Thread Greg Landrum
FWIW: I checked in a lazy bit vector picker this weekend. This still uses the MaxMin algorithm, but does not require either a python callback function or pre-computation of the distance matrix. There's some sample code for using it (along with timing data) here:

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-17 Thread David Cosgrove
If you don't mind writing some extra code, we've had good success with a Monte Carlo implementation of a maximin diversity picker called BigPicker, described in Blomberg et al, JCAMD, 23, 513-525 (2009). With this implementation, you only need to keep the subset distance matrix in memory. At each

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-17 Thread Matthew Lardy
Hi Greg, Thanks! My Python is really rusty at the moment, so I am always unsure if I am just not going through the steps in the most efficient manner, or if the path that I am following is far from ideal. Granted I would prefer to write this in Java, but I should probably move back towards C++

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-17 Thread Matthew Lardy
Hi Dave, That's interesting, and I'll look into it. As I wrote Greg, I am an aficionado of the Gobbi Lee method described here (since we are sharing our favourite methods): http://pubs.acs.org/doi/abs/10.1021/ci025554v While the set I am looking at the moment contains only 26K molecules, I

[Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Matthew Lardy
Hi all, I have been playing with the diversity selection in RDKit. I am running through a set of ~26,000 molecules to pick a set of 200 diverse molecules. I saw some examples of how to do this in Python (my variant of their script below), but the memory consumption is massive. I burned through

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Igor Filippov
Matthew, Two lines of shameless self-promotion: This is exactly the kind of problem for Diversity Genie - http://www.diversitygenie.com/ It is using RDKit library underneath, but wraps it in a simple, easy to use GUI front-end. Best regards, Igor On Wed, Jul 16, 2014 at 6:18 PM, Matthew Lardy

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Matthew Lardy
Hi Igor, Thanks! Maybe I am a throwback, but I prefer the command line to a GUI. Still I'll give it a whirl! :) If you are handling millions of molecules without issue; then my Python skills are really, really, rusty. Or, I shouldn't be using Python to handle this much data. :) Thanks for

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Patrick Fuller
Try using parentheses instead of square brackets. This converts lists to generators https://wiki.python.org/moin/Generators, which will take up almost no memory. Haven’t tested it, but here’s how it would impact your code: from rdkit import Chemfrom rdkit.Chem import AllChemfrom rdkit import

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Matthew Lardy
Hi Markus, It looks like the memory consumption (initially) drops. Still it gets out of control, likely after the file is read. Here is the file info: -rw-rw-r--. 1 mlardy mlardy 1.6M Jul 16 16:40 a.sdf.gz Looking into Patrick's suggestion, I got the first error: NameError: name 'array' is not

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Greg Landrum
On Thu, Jul 17, 2014 at 1:58 AM, Matthew Lardy mla...@gmail.com wrote: It looks like the memory consumption (initially) drops. Still it gets out of control, likely after the file is read. That is, most likely, due to the fact that distance matrix itself is huge. Still, 26K molecules should

Re: [Rdkit-discuss] MaxMin Picker and Python

2014-07-16 Thread Greg Landrum
one other short thing. If this is the code you are using for the distance matrix: On Thu, Jul 17, 2014 at 12:18 AM, Matthew Lardy mla...@gmail.com wrote: dm=[] for i,fp in enumerate(zims_fps[:26000]): # only 1000 in the demo (in the interest of time)