Hi, I am attempting to cluster a database of circa 4M small molecules and I have hit several snags. Using BulkTanimoto is not possible due to resiurces that are required. I am now working with fpsim2 and chemfp to get a distance matrix (sparse matrix). However, I am finding it very challenging to identify an appropriate clustering algorithm. I have considered both k-medoids and DBSCAN. Each of these has its own limitations, stating the number of clusters for k-medoids and not obtaining centroids for DBSCAN.
I was wondering whether there is an implementation of the stochastic clustering analysis for clustering purposes, described in https://doi.org/10.1021/ci970056l . Any suggestions on the best method for clustering large datasets, with code suggestions, would be greatly appreciated. I am new to the subject and would appreciate any help. Regards, Tristan
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss