You could consider using FAISS. An example of clustering 2.1M cmpds is described at http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < tristan.camilleri...@um.edu.mt> wrote: > Hi, > > I am attempting to cluster a database of circa 4M small molecules and I > have hit several snags. > Using BulkTanimoto is not possible due to resiurces that are required. I > am now working with fpsim2 and chemfp to get a distance matrix (sparse > matrix). However, I am finding it very challenging to identify an > appropriate clustering algorithm. I have considered both k-medoids and > DBSCAN. Each of these has its own limitations, stating the number of > clusters for k-medoids and not obtaining centroids for DBSCAN. > > I was wondering whether there is an implementation of the stochastic > clustering analysis for clustering purposes, described in > https://doi.org/10.1021/ci970056l . > > Any suggestions on the best method for clustering large datasets, with > code suggestions, would be greatly appreciated. I am new to the subject and > would appreciate any help. > > Regards, > Tristan > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss