Dear Giovanni, Many thanks for the feedback which is very insightful. I had skipped Greg Landrum’s post on sphere exclusion clustering, I will certainly give it a try.
Regards, Tristan On Mon, 2 May 2022 at 09:45, Giovanni Tricarico <giovanni.tricar...@glpg.com> wrote: > Hello Tristan, > > I imagine you have seen Greg Landrum’s post on sphere exclusion > clustering: > https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html > > The initial MaxMin selection is very fast, especially Roger Sayle’s > implementation, and not memory-hungry, and that gives you maximally diverse > cluster centres (or ‘representatives’ as you call them). > > [Yes, it’s true that the centres can be ‘weird’, but if they *are* in > your set, maybe the issue is in the composition of the set itself, not in > what you do with it; maybe you should remove singletons (initially picked > molecules that have Tanimoto distance 1 or very close to 1 from all other > molecules) upfront]. > > Unless I am missing something, it sounds surprising that you cannot use > BulkTanimotoSimilarity to do the clustering. > > The size of the similarity matrix is not (4M)^2, but only ~4M * > len(picks); so unless you picked almost everything…(?) > > In fact to counteract even this potential issue, we made our own simple > variation of the above code that only processes one molecule at a time, and > applied it without issues to a 1M set; surely you should be able to make a > temporary list of floating point numbers of length len(pick). > > > > On the other hand, like others also commented, it’s not clear why you want > to do the clustering in the first place. > > Searching by fingerprint similarity in a set of 4 M molecules should not > be that slow. > > > > *As for:* > > *“I am now working with fpsim2 and chemfp to get a distance matrix (sparse > matrix)”* > > > > IMO you do not necessarily need to have a full N^2 distance matrix to do > what you describe, as indeed shown by Greg’s post. > > And more importantly, a distance matrix for a molecular set is rarely > ‘sparse’ (where ‘sparse’ in this context means mostly full of 0’s, and with > only few non-0 elements), because that only happens when most molecules > have identical FP’s, which should be very rare, unless your set is very > redundant. > > > > In this sense, I would suggest an alternative clustering approach (if you > are convinced that you *do* want to do clustering). > > Make a sparse feature matrix (binary matrix encoding the ‘on’ FP bits of > your molecules vs their index). > > From that, make not a *distance*, but a (Tanimoto) *similarity* matrix, > with a threshold below which it is capped to 0 (I can only suggest R > package proxyC for this > https://cran.r-project.org/web/packages/proxyC/index.html; maybe there is > a python implementation, I don’t know). > > That matrix can indeed be very sparse: only molecules that have a > similarity higher than your threshold have a value, the rest is all 0 and > is not stored. Obviously you should not use too low a threshold, otherwise > you are back to a dense (4M)^2 matrix, which I assume you cannot make or > store. > > From this (which is essentially an *adjacency* matrix - telling you which > molecules are or are not ‘linked’) you can do graph representations and > even clustering, e.g. using igraph (again, I can only suggest an R package > for that https://igraph.org/r/, https://kateto.net/netscix2016.html). > > > > Hope this helps. > > Giovanni > > > > *From:* Tristan Camilleri <tristan.camilleri...@um.edu.mt> > *Sent:* 02 May 2022 07:03 > *To:* Patrick Walters <wpwalt...@gmail.com> > *Cc:* RDKit Discuss <rdkit-discuss@lists.sourceforge.net> > *Subject:* Re: [Rdkit-discuss] Clustering > > > > Some people who received this message don't often get email from > tristan.camilleri...@um.edu.mt. Learn why this is important > <http://aka.ms/LearnAboutSenderIdentification> > > Thanks for the feedback. Rather than an explicit need to perform > clustering, it is more for me to learn how to do it. > > > > Any pointers to this effect would be greatly appreciated. > > > > Tristan > > > > On Sun, 1 May 2022 at 18:18, Patrick Walters <wpwalt...@gmail.com> wrote: > > Similarity search on a database of 4 million is pretty quick with ChemFp > or fpsim2. Do you need to do the clustering? > > > > Here are a couple of relevant blog posts. > > > > > http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html > <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B4neEhHjjQFPEbNLNWFyoJ9uuSCVMmpcQ34%2B32tykdI%3D&reserved=0> > > > > > http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html > <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2021%2F09%2Fsimilarity-search-and-some-cool-pandas.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jUGKrSrLogywBYsSyHqYdgNaY0IhUmCWovOSto8QmGk%3D&reserved=0> > > > > Pat > > > > > > > > On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri < > tristan.camilleri...@um.edu.mt> wrote: > > Thank you both for the feedback. > > > > My primary aim is to run an LBVS experiment (similarity search) using a > set of actives and the dataset of cluster representatives. > > > > > > > > On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com> wrote: > > For me, a lot of this depends on what you intend to do with the > clustering. If you want to pick a "representative" subset from a larger > dataset, k-means may do the trick. As Rajarshi mentioned, Practical > Cheminformatics has a k-means implementation that runs with FAISS. > Depending on your goal, choosing a subset with a diversity picker may fit > the bill. One annoying aspect of diversity pickers is that the initial > selections tend to consist of strange molecules. > > > > @Tristen can you provide more information on what you want to do with the > clustering results? > > > > > > Pat > > > > On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com> > wrote: > > You could consider using FAISS. An example of clustering 2.1M cmpds is > described at > http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html > <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2019%2F04%2Fclustering-21-million-compounds-for-5.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TdoVQw3Q6RIeQP5%2FNtHUOMWxjYUfcThKAqsEK2eX2tI%3D&reserved=0> > > > > > > On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri < > tristan.camilleri...@um.edu.mt> wrote: > > Hi, > > > > I am attempting to cluster a database of circa 4M small molecules and I > have hit several snags. > > Using BulkTanimoto is not possible due to resiurces that are required. I > am now working with fpsim2 and chemfp to get a distance matrix (sparse > matrix). However, I am finding it very challenging to identify an > appropriate clustering algorithm. I have considered both k-medoids and > DBSCAN. Each of these has its own limitations, stating the number of > clusters for k-medoids and not obtaining centroids for DBSCAN. > > > > I was wondering whether there is an implementation of the stochastic > clustering analysis for clustering purposes, described in > https://doi.org/10.1021/ci970056l > <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1021%2Fci970056l&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8B5%2F6sdCtGNaZbfq1Dxf7n%2BOSuLDDJxB%2B%2BVaAnSs9Sw%3D&reserved=0> > . > > > > Any suggestions on the best method for clustering large datasets, with > code suggestions, would be greatly appreciated. I am new to the subject and > would appreciate any help. > > > > Regards, > > Tristan > > > > > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0> > > > > > -- > > Rajarshi Guha | http://blog.rguha.net > <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.rguha.net%2F&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AVte6oJOpTc7uZmL1ns18cJ%2BhKE2GOVLyExP6OF05EU%3D&reserved=0> > | > @rguha > <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Frguha&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9uxDGNHUjF9DZSEYLmsAmlyfh7svgr0hy8fBtVTpHD0%3D&reserved=0> > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0> > > This e-mail and its attachment(s) (if any) may contain confidential and/or > proprietary information and is intended for its addressee(s) only. Any > unauthorized use of the information contained herein (including, but not > limited to, alteration, reproduction, communication, distribution or any > other form of dissemination) is strictly prohibited. If you are not the > intended addressee, please notify the originator promptly and delete this > e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor > any of its affiliates shall be liable for direct, special, indirect or > consequential damages arising from alteration of the contents of this > message (by a third party) or as a result of a virus being passed on. >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss