Hello Tristan,
I imagine you have seen Greg Landrum's post on sphere exclusion clustering: 
https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
The initial MaxMin selection is very fast, especially Roger Sayle's 
implementation, and not memory-hungry, and that gives you maximally diverse 
cluster centres (or 'representatives' as you call them).
[Yes, it's true that the centres can be 'weird', but if they are in your set, 
maybe the issue is in the composition of the set itself, not in what you do 
with it; maybe you should remove singletons (initially picked molecules that 
have Tanimoto distance 1 or very close to 1 from all other molecules) upfront].
Unless I am missing something, it sounds surprising that you cannot use 
BulkTanimotoSimilarity to do the clustering.
The size of the similarity matrix is not (4M)^2, but only ~4M * len(picks); so 
unless you picked almost everything...(?)
In fact to counteract even this potential issue, we made our own simple 
variation of the above code that only processes one molecule at a time, and 
applied it without issues to a 1M set; surely you should be able to make a 
temporary list of floating point numbers of length len(pick).

On the other hand, like others also commented, it's not clear why you want to 
do the clustering in the first place.
Searching by fingerprint similarity in a set of 4 M molecules should not be 
that slow.

As for:
"I am now working with fpsim2 and chemfp to get a distance matrix (sparse 
matrix)"

IMO you do not necessarily need to have a full N^2 distance matrix to do what 
you describe, as indeed shown by Greg's post.
And more importantly, a distance matrix for a molecular set is rarely 'sparse' 
(where 'sparse' in this context means mostly full of 0's, and with only few 
non-0 elements), because that only happens when most molecules have identical 
FP's, which should be very rare, unless your set is very redundant.

In this sense, I would suggest an alternative clustering approach (if you are 
convinced that you do want to do clustering).
Make a sparse feature matrix (binary matrix encoding the 'on' FP bits of your 
molecules vs their index).
>From that, make not a distance, but a (Tanimoto) similarity matrix, with a 
>threshold below which it is capped to 0 (I can only suggest R package proxyC 
>for this https://cran.r-project.org/web/packages/proxyC/index.html; maybe 
>there is a python implementation, I don't know).
That matrix can indeed be very sparse: only molecules that have a similarity 
higher than your threshold have a value, the rest is all 0 and is not stored. 
Obviously you should not use too low a threshold, otherwise you are back to a 
dense (4M)^2 matrix, which I assume you cannot make or store.
>From this (which is essentially an adjacency matrix - telling you which 
>molecules are or are not 'linked') you can do graph representations and even 
>clustering, e.g. using igraph (again, I can only suggest an R package for that 
>https://igraph.org/r/, https://kateto.net/netscix2016.html).

Hope this helps.
Giovanni

From: Tristan Camilleri <tristan.camilleri...@um.edu.mt>
Sent: 02 May 2022 07:03
To: Patrick Walters <wpwalt...@gmail.com>
Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net>
Subject: Re: [Rdkit-discuss] Clustering

Some people who received this message don't often get email from 
tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>. Learn 
why this is important<http://aka.ms/LearnAboutSenderIdentification>
Thanks for the feedback. Rather than an explicit need to perform clustering, it 
is more for me to learn how to do it.

Any pointers to this effect would be greatly appreciated.

Tristan

On Sun, 1 May 2022 at 18:18, Patrick Walters 
<wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote:
Similarity search on a database of 4 million is pretty quick with ChemFp or 
fpsim2.  Do you need to do the clustering?

Here are a couple of relevant blog posts.

http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B4neEhHjjQFPEbNLNWFyoJ9uuSCVMmpcQ34%2B32tykdI%3D&reserved=0>

http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2021%2F09%2Fsimilarity-search-and-some-cool-pandas.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jUGKrSrLogywBYsSyHqYdgNaY0IhUmCWovOSto8QmGk%3D&reserved=0>

Pat



On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri 
<tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>> wrote:
Thank you both for the feedback.

My primary aim is to run an LBVS experiment (similarity search) using a set of 
actives and the dataset of cluster representatives.



On Sun, 1 May 2022, 17:09 Patrick Walters, 
<wpwalt...@gmail.com<mailto:wpwalt...@gmail.com>> wrote:
For me, a lot of this depends on what you intend to do with the clustering.  If 
you want to pick a "representative" subset from a larger dataset, k-means may 
do the trick.  As Rajarshi mentioned, Practical Cheminformatics has a k-means 
implementation that runs with FAISS. Depending on your goal, choosing a subset 
with a diversity picker may fit the bill.  One annoying aspect of diversity 
pickers is that the initial selections tend to consist of strange molecules.

@Tristen can you provide more information on what you want to do with the 
clustering results?


Pat

On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
<rajarshi.g...@gmail.com<mailto:rajarshi.g...@gmail.com>> wrote:
You could consider using FAISS. An example of clustering 2.1M cmpds is 
described at 
http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2019%2F04%2Fclustering-21-million-compounds-for-5.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TdoVQw3Q6RIeQP5%2FNtHUOMWxjYUfcThKAqsEK2eX2tI%3D&reserved=0>


On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri 
<tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>> wrote:
Hi,

I am attempting to cluster a database of circa 4M small molecules and I have 
hit several snags.
Using BulkTanimoto is not possible due to resiurces that are required. I am now 
working with fpsim2 and chemfp to get a distance matrix (sparse matrix). 
However, I am finding it very challenging to identify an appropriate clustering 
algorithm. I have considered both k-medoids and DBSCAN. Each of these has its 
own limitations, stating the number of clusters for k-medoids and not obtaining 
centroids for DBSCAN.

I was wondering whether there is an implementation of the stochastic clustering 
analysis for clustering purposes, described in 
https://doi.org/10.1021/ci970056l<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1021%2Fci970056l&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8B5%2F6sdCtGNaZbfq1Dxf7n%2BOSuLDDJxB%2B%2BVaAnSs9Sw%3D&reserved=0>
 .

Any suggestions on the best method for clustering large datasets, with code 
suggestions, would be greatly appreciated. I am new to the subject and would 
appreciate any help.

Regards,
Tristan


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0>


--
Rajarshi Guha | 
http://blog.rguha.net<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.rguha.net%2F&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AVte6oJOpTc7uZmL1ns18cJ%2BhKE2GOVLyExP6OF05EU%3D&reserved=0>
 | 
@rguha<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Frguha&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9uxDGNHUjF9DZSEYLmsAmlyfh7svgr0hy8fBtVTpHD0%3D&reserved=0>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net<mailto:Rdkit-discuss@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0>
This e-mail and its attachment(s) (if any) may contain confidential and/or 
proprietary information and is intended for its addressee(s) only. Any 
unauthorized use of the information contained herein (including, but not 
limited to, alteration, reproduction, communication, distribution or any other 
form of dissemination) is strictly prohibited. If you are not the intended 
addressee, please notify the originator promptly and delete this e-mail and its 
attachment(s) (if any) subsequently. Neither Galapagos nor any of its 
affiliates shall be liable for direct, special, indirect or consequential 
damages arising from alteration of the contents of this message (by a third 
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to