Thanks, Andrew! Yes, I was thinking about using scikit-learn also. But I guess I need to use a data structure for sparse matrix and define a function for connectivity. I hope the memory issue won't be a problem. Most AgglomerativeClustering algorithms have time complexity with N^2. Will that be a problem?
Best, Jing On Sun, Aug 23, 2015 at 3:13 AM, Andrew Dalke <da...@dalkescientific.com> wrote: > On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: > > If I want to cluster more than 1M molecules by ECFP4. How could I do it? > If I calculate the distance between every pair of molecules, the size of > distance matrix will be too big. Does RDKit support any heuristic > clustering algorithm without calculating the distance matrix of the whole > library? > > You should look to a third-party package, like scikit-learn from > http://scikit-learn.org/ , for clustering. That has a very extensive set > of clustering algorithms, including k-means at > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html > . > > Though you may be interested in the note on that page: "For large scale > learning (say n_samples > 10k) MiniBatchKMeans is probably much faster to > than the default batch implementation." > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html > > My memory is that k-means depends on a Euclidean distance between the > records, which is different from the usual Tanimoto (or metric-like > 1-Tanimoto) in cheminformatics. > > If you would rather use Tanimoto, then perhaps try a method like > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html > ? > > If you go that route, and you want to build the full 1M x 1M distance > matrix, the usual approach is to ignore similarities below a given > threshold T (e.g., T<0.8). This can be thought of as either setting those > entries to 0.0, or specifying an "ignore" flag. In either case, the result > can be stored in a sparse matrix, which is efficient at storing only the > data of interest. > > Using my package, chemfp, from http://chemfp.com/ you can compute a > sparse matrix for 1M x 1M fingerprints in about an hour using a laptop or > desktop. > > The question would then be how to adapt the parse output format from > chemfp to the sparse input format for your clustering method of choice. > > Best regards, > > Andrew > da...@dalkescientific.com > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss