Hello Tristan, I imagine you have seen Greg Landrum's post on sphere exclusion clustering: https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html The initial MaxMin selection is very fast, especially Roger Sayle's implementation, and not memory-hungry, and that gives you maximally diverse cluster centres (or 'representatives' as you call them). [Yes, it's true that the centres can be 'weird', but if they are in your set, maybe the issue is in the composition of the set itself, not in what you do with it; maybe you should remove singletons (initially picked molecules that have Tanimoto distance 1 or very close to 1 from all other molecules) upfront]. Unless I am missing something, it sounds surprising that you cannot use BulkTanimotoSimilarity to do the clustering. The size of the similarity matrix is not (4M)^2, but only ~4M * len(picks); so unless you picked almost everything...(?) In fact to counteract even this potential issue, we made our own simple variation of the above code that only processes one molecule at a time, and applied it without issues to a 1M set; surely you should be able to make a temporary list of floating point numbers of length len(pick).
On the other hand, like others also commented, it's not clear why you want to do the clustering in the first place. Searching by fingerprint similarity in a set of 4 M molecules should not be that slow. As for: "I am now working with fpsim2 and chemfp to get a distance matrix (sparse matrix)" IMO you do not necessarily need to have a full N^2 distance matrix to do what you describe, as indeed shown by Greg's post. And more importantly, a distance matrix for a molecular set is rarely 'sparse' (where 'sparse' in this context means mostly full of 0's, and with only few non-0 elements), because that only happens when most molecules have identical FP's, which should be very rare, unless your set is very redundant. In this sense, I would suggest an alternative clustering approach (if you are convinced that you do want to do clustering). Make a sparse feature matrix (binary matrix encoding the 'on' FP bits of your molecules vs their index). >From that, make not a distance, but a (Tanimoto) similarity matrix, with a >threshold below which it is capped to 0 (I can only suggest R package proxyC >for this https://cran.r-project.org/web/packages/proxyC/index.html; maybe >there is a python implementation, I don't know). That matrix can indeed be very sparse: only molecules that have a similarity higher than your threshold have a value, the rest is all 0 and is not stored. Obviously you should not use too low a threshold, otherwise you are back to a dense (4M)^2 matrix, which I assume you cannot make or store. >From this (which is essentially an adjacency matrix - telling you which >molecules are or are not 'linked') you can do graph representations and even >clustering, e.g. using igraph (again, I can only suggest an R package for that >https://igraph.org/r/, https://kateto.net/netscix2016.html). Hope this helps. Giovanni From: Tristan Camilleri <tristan.camilleri...@um.edu.mt> Sent: 02 May 2022 07:03 To: Patrick Walters <wpwalt...@gmail.com> Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] Clustering Some people who received this message don't often get email from tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>. Learn why this is important<http://aka.ms/LearnAboutSenderIdentification> Thanks for the feedback. Rather than an explicit need to perform clustering, it is more for me to learn how to do it. Any pointers to this effect would be greatly appreciated. Tristan On Sun, 1 May 2022 at 18:18, Patrick Walters <wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote: Similarity search on a database of 4 million is pretty quick with ChemFp or fpsim2. Do you need to do the clustering? Here are a couple of relevant blog posts. http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B4neEhHjjQFPEbNLNWFyoJ9uuSCVMmpcQ34%2B32tykdI%3D&reserved=0> http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2021%2F09%2Fsimilarity-search-and-some-cool-pandas.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jUGKrSrLogywBYsSyHqYdgNaY0IhUmCWovOSto8QmGk%3D&reserved=0> Pat On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>> wrote: Thank you both for the feedback. My primary aim is to run an LBVS experiment (similarity search) using a set of actives and the dataset of cluster representatives. On Sun, 1 May 2022, 17:09 Patrick Walters, <wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote: For me, a lot of this depends on what you intend to do with the clustering. If you want to pick a "representative" subset from a larger dataset, k-means may do the trick. As Rajarshi mentioned, Practical Cheminformatics has a k-means implementation that runs with FAISS. Depending on your goal, choosing a subset with a diversity picker may fit the bill. One annoying aspect of diversity pickers is that the initial selections tend to consist of strange molecules. @Tristen can you provide more information on what you want to do with the clustering results? Pat On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <rajarshi.g...@gmail.com<mailto:rajarshi.g...@gmail.com>> wrote: You could consider using FAISS. An example of clustering 2.1M cmpds is described at http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2019%2F04%2Fclustering-21-million-compounds-for-5.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TdoVQw3Q6RIeQP5%2FNtHUOMWxjYUfcThKAqsEK2eX2tI%3D&reserved=0> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>> wrote: Hi, I am attempting to cluster a database of circa 4M small molecules and I have hit several snags. Using BulkTanimoto is not possible due to resiurces that are required. I am now working with fpsim2 and chemfp to get a distance matrix (sparse matrix). However, I am finding it very challenging to identify an appropriate clustering algorithm. I have considered both k-medoids and DBSCAN. Each of these has its own limitations, stating the number of clusters for k-medoids and not obtaining centroids for DBSCAN. I was wondering whether there is an implementation of the stochastic clustering analysis for clustering purposes, described in https://doi.org/10.1021/ci970056l<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1021%2Fci970056l&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8B5%2F6sdCtGNaZbfq1Dxf7n%2BOSuLDDJxB%2B%2BVaAnSs9Sw%3D&reserved=0> . Any suggestions on the best method for clustering large datasets, with code suggestions, would be greatly appreciated. I am new to the subject and would appreciate any help. Regards, Tristan _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0> -- Rajarshi Guha | http://blog.rguha.net<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.rguha.net%2F&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AVte6oJOpTc7uZmL1ns18cJ%2BhKE2GOVLyExP6OF05EU%3D&reserved=0> | @rguha<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Frguha&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9uxDGNHUjF9DZSEYLmsAmlyfh7svgr0hy8fBtVTpHD0%3D&reserved=0> _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0> This e-mail and its attachment(s) (if any) may contain confidential and/or proprietary information and is intended for its addressee(s) only. Any unauthorized use of the information contained herein (including, but not limited to, alteration, reproduction, communication, distribution or any other form of dissemination) is strictly prohibited. If you are not the intended addressee, please notify the originator promptly and delete this e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor any of its affiliates shall be liable for direct, special, indirect or consequential damages arising from alteration of the contents of this message (by a third party) or as a result of a virus being passed on.
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss