Hi RDKitlers,
Recently, I had to cluster a large number of compounds and I took a shot at
the Butina clustering. I modified the rdkits included Butina clustering to
utilize the Bulk*Similarity functions and to use multiple python processes
to calculate the distance matrix more quickly. Although these are very naive
improvements, I'm able to read in ChEMBL, generate the Morgan fingerprints
and cluster 1.3M compounds within 2 hours using 32 cores and 8G of RAM. I
thought these modification might come in handy for other people doing the
same kind of clustering directly within rdkit and I would like to share the
code. Would be the rdkit contrib folder the right place for these kind of
work?
Cheers,
Thomas
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss