Thanks for the feedback. Rather than an explicit need to perform clustering, it is more for me to learn how to do it.
Any pointers to this effect would be greatly appreciated. Tristan On Sun, 1 May 2022 at 18:18, Patrick Walters <wpwalt...@gmail.com> wrote: > Similarity search on a database of 4 million is pretty quick with ChemFp > or fpsim2. Do you need to do the clustering? > > Here are a couple of relevant blog posts. > > > http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html > > > http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html > > Pat > > > > On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri < > tristan.camilleri...@um.edu.mt> wrote: > >> Thank you both for the feedback. >> >> My primary aim is to run an LBVS experiment (similarity search) using a >> set of actives and the dataset of cluster representatives. >> >> >> >> On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com> wrote: >> >>> For me, a lot of this depends on what you intend to do with the >>> clustering. If you want to pick a "representative" subset from a larger >>> dataset, k-means may do the trick. As Rajarshi mentioned, Practical >>> Cheminformatics has a k-means implementation that runs with FAISS. >>> Depending on your goal, choosing a subset with a diversity picker may fit >>> the bill. One annoying aspect of diversity pickers is that the initial >>> selections tend to consist of strange molecules. >>> >>> @Tristen can you provide more information on what you want to do with >>> the clustering results? >>> >>> >>> Pat >>> >>> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com> >>> wrote: >>> >>>> You could consider using FAISS. An example of clustering 2.1M cmpds is >>>> described at >>>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html >>>> >>>> >>>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < >>>> tristan.camilleri...@um.edu.mt> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am attempting to cluster a database of circa 4M small molecules and >>>>> I have hit several snags. >>>>> Using BulkTanimoto is not possible due to resiurces that are required. >>>>> I am now working with fpsim2 and chemfp to get a distance matrix (sparse >>>>> matrix). However, I am finding it very challenging to identify an >>>>> appropriate clustering algorithm. I have considered both k-medoids and >>>>> DBSCAN. Each of these has its own limitations, stating the number of >>>>> clusters for k-medoids and not obtaining centroids for DBSCAN. >>>>> >>>>> I was wondering whether there is an implementation of the stochastic >>>>> clustering analysis for clustering purposes, described in >>>>> https://doi.org/10.1021/ci970056l . >>>>> >>>>> Any suggestions on the best method for clustering large datasets, with >>>>> code suggestions, would be greatly appreciated. I am new to the subject >>>>> and >>>>> would appreciate any help. >>>>> >>>>> Regards, >>>>> Tristan >>>>> >>>>> >>>>> _______________________________________________ >>>>> Rdkit-discuss mailing list >>>>> Rdkit-discuss@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>>> >>>> >>>> >>>> -- >>>> Rajarshi Guha | http://blog.rguha.net | @rguha >>>> <https://twitter.com/rguha> >>>> >>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>>
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss