Thank you both for the feedback. My primary aim is to run an LBVS experiment (similarity search) using a set of actives and the dataset of cluster representatives.
On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com> wrote: > For me, a lot of this depends on what you intend to do with the > clustering. If you want to pick a "representative" subset from a larger > dataset, k-means may do the trick. As Rajarshi mentioned, Practical > Cheminformatics has a k-means implementation that runs with FAISS. > Depending on your goal, choosing a subset with a diversity picker may fit > the bill. One annoying aspect of diversity pickers is that the initial > selections tend to consist of strange molecules. > > @Tristen can you provide more information on what you want to do with the > clustering results? > > > Pat > > On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com> > wrote: > >> You could consider using FAISS. An example of clustering 2.1M cmpds is >> described at >> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html >> >> >> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < >> tristan.camilleri...@um.edu.mt> wrote: >> >>> Hi, >>> >>> I am attempting to cluster a database of circa 4M small molecules and I >>> have hit several snags. >>> Using BulkTanimoto is not possible due to resiurces that are required. I >>> am now working with fpsim2 and chemfp to get a distance matrix (sparse >>> matrix). However, I am finding it very challenging to identify an >>> appropriate clustering algorithm. I have considered both k-medoids and >>> DBSCAN. Each of these has its own limitations, stating the number of >>> clusters for k-medoids and not obtaining centroids for DBSCAN. >>> >>> I was wondering whether there is an implementation of the stochastic >>> clustering analysis for clustering purposes, described in >>> https://doi.org/10.1021/ci970056l . >>> >>> Any suggestions on the best method for clustering large datasets, with >>> code suggestions, would be greatly appreciated. I am new to the subject and >>> would appreciate any help. >>> >>> Regards, >>> Tristan >>> >>> >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >> >> >> -- >> Rajarshi Guha | http://blog.rguha.net | @rguha >> <https://twitter.com/rguha> >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss