Re: [Rdkit-discuss] Clustering

Andrew Dalke Mon, 05 Jun 2017 19:02:06 -0700

On Jun 5, 2017, at 11:02, Michał Nowotka <mmm...@gmail.com> wrote:
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?

Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M compound 
data sets using the Taylor-Butina algorithm.

> It's easy to get a sparse distance matrix using chemfp. But if you
> take this matrix and feed it into any scipy.cluster you want get any
> results in a reasonable time.

On Jun 5, 2017, at 10:19, Gonzalo Colmenarejo <colmenarejo.gonz...@gmail.com> 
wrote:
> I think there are faster things, like chemfp (see for instance 
> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>  

I've fleshed out that algorithm so it's a command-line program that can be used 
for benchmarking purposes. It's available from 
http://dalkescientific.com/writings/taylor_butina.py .

If anyone uses it for benchmarking, or has improvements, let me know. If I get 
useful feedback about it, I'll include it in an upcoming chemfp 1.3 release.

On my 7-year old desktop (3.2 GHz Intel Core i3), with 4 OpenMP threads (the 
default) I estimate it would take a bit over an hour to cluster 1M 
fingerprints, with 2048 bits/fingerprint, using a threshold of 0.8.

I just ran it now with the following:

% time python taylor_butina.py --profile --threshold 0.8 pubchem_million.fps -o 
pubchem_million.clusters

The profile report says:

#fingerprints: 1000000 #bits/fp: 881 threshold: 0.8 #matches: 667808946
Load time: 4.9 sec memory: 288 MB
Similarity time: 2037.3 sec memory: 8.00 GB
Clustering time: 17.5 sec memory: -405549056 B
Total time: 2061.3 sec
7044.489u 323.528s 34:26.26 356.5%      0+0k 5+13io 233pf+0w

The "load" lines says the data needed 288 MB for 1M 881 bit fingerprints. This 
scales linearly in the number of bits and the number of records. The load time 
is small compared to the similarity time.

The "similarity" search took about 8GB for a sparse matrix with 667,808,946 
terms, or about 13 bytes per term (each hit requires an int (4 bytes) and a 
double (8 bytes), plus overhead.) This does not depend on the number of bits in 
the fingerprint, only the number of matches in the matrix.

The "similarity" time is linear in the number of bits, so 2048 bits would be 
about 2.3x slower. It's quadratic in the number of fingerprints, so 2M 
fingerprints would take about 2 hours.

It took 18 seconds to do the Taylor-Butina clustering, given the sparse matrix. 
On a multi-core machine, if you watch the CPU usage you'll easily see when it 
goes from the multi-threaded similarity search code (in C using OpenMP) to the 
single-threaded clustering code.

The "clustering" line says it took -405549056 bytes. That should be -387 MB. I 
expected this value to be positive. I'm not sure what's happening there to give 
a negative number, but that's not an important term.

Cheers,
                                Andrew
                                da...@dalkescientific.com

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Clustering

Reply via email to