Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Andrew Dalke
On Jun 5, 2017, at 11:02, Michał Nowotka wrote: > Is there anyone who actually done this: clustered >2M compounds using > any well-known clustering algorithm and is willing to share a code and > some performance statistics? Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M comp

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread David Cosgrove
Hi, I have used this algorithm for many years clustering sets of several millions of compounds. Indeed, I am old enough to know it as the Taylor algorithm. It is slow but reliable. A crucial setting is the similarity threshold for the clusters, which dictates the size of the neighbour lists and

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Nils Weskamp
Hi Michal, I have done this a couple of times for compound sets up to 10M+ using a simplified variant of the Taylor-Butina algorithm. The overall run time was in the range of hours to a few days (which could probably be optimized, but was fast enough for me). As you correctly mentioned, getting t

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Chris Swain
Hi, I’m just starting but I can add another example I tried the clustering as described for the Butina clustering (http://www.rdkit.org/docs/Cookbook.html ) using a Jupiter Notebook. Worked fine on data sets < 10,000 molecules but kernel crash when I tr

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Michał Nowotka
Hi, Is there anyone who actually done this: clustered >2M compounds using any well-known clustering algorithm and is willing to share a code and some performance statistics? It's easy to get a sparse distance matrix using chemfp. But if you take this matrix and feed it into any scipy.cluster you

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Gonzalo Colmenarejo
Hi Chris, as far as I know, Butina's sphere exclusion algorithm is the fastest for very large datasets. But if you have 4 million compounds, using RDKit directly can result in very long runs, even after parallellization. For that number of molecules I think there are faster things, like chemfp (se