Re: [Rdkit-discuss] Clustering

Tristan Camilleri Mon, 02 May 2022 01:28:06 -0700

Dear Giovanni,

Many thanks for the feedback which is very insightful. I had skipped Greg
Landrum’s post on sphere exclusion clustering, I will certainly give it a
try.


Regards,
Tristan

On Mon, 2 May 2022 at 09:45, Giovanni Tricarico <[email protected]>
wrote:

> Hello Tristan,
>
> I imagine you have seen Greg Landrum’s post on sphere exclusion
> clustering:
> https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
>
> The initial MaxMin selection is very fast, especially Roger Sayle’s
> implementation, and not memory-hungry, and that gives you maximally diverse
> cluster centres (or ‘representatives’ as you call them).
>
> [Yes, it’s true that the centres can be ‘weird’, but if they *are* in
> your set, maybe the issue is in the composition of the set itself, not in
> what you do with it; maybe you should remove singletons (initially picked
> molecules that have Tanimoto distance 1 or very close to 1 from all other
> molecules) upfront].
>
> Unless I am missing something, it sounds surprising that you cannot use
> BulkTanimotoSimilarity to do the clustering.
>
> The size of the similarity matrix is not (4M)^2, but only ~4M *
> len(picks); so unless you picked almost everything…(?)
>
> In fact to counteract even this potential issue, we made our own simple
> variation of the above code that only processes one molecule at a time, and
> applied it without issues to a 1M set; surely you should be able to make a
> temporary list of floating point numbers of length len(pick).
>
>
>
> On the other hand, like others also commented, it’s not clear why you want
> to do the clustering in the first place.
>
> Searching by fingerprint similarity in a set of 4 M molecules should not
> be that slow.
>
>
>
> *As for:*
>
> *“I am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix)”*
>
>
>
> IMO you do not necessarily need to have a full N^2 distance matrix to do
> what you describe, as indeed shown by Greg’s post.
>
> And more importantly, a distance matrix for a molecular set is rarely
> ‘sparse’ (where ‘sparse’ in this context means mostly full of 0’s, and with
> only few non-0 elements), because that only happens when most molecules
> have identical FP’s, which should be very rare, unless your set is very
> redundant.
>
>
>
> In this sense, I would suggest an alternative clustering approach (if you
> are convinced that you *do* want to do clustering).
>
> Make a sparse feature matrix (binary matrix encoding the ‘on’ FP bits of
> your molecules vs their index).
>
> From that, make not a *distance*, but a (Tanimoto) *similarity* matrix,
> with a threshold below which it is capped to 0 (I can only suggest R
> package proxyC for this
> https://cran.r-project.org/web/packages/proxyC/index.html; maybe there is
> a python implementation, I don’t know).
>
> That matrix can indeed be very sparse: only molecules that have a
> similarity higher than your threshold have a value, the rest is all 0 and
> is not stored. Obviously you should not use too low a threshold, otherwise
> you are back to a dense (4M)^2 matrix, which I assume you cannot make or
> store.
>
> From this (which is essentially an *adjacency* matrix - telling you which
> molecules are or are not ‘linked’) you can do graph representations and
> even clustering, e.g. using igraph (again, I can only suggest an R package
> for that https://igraph.org/r/, https://kateto.net/netscix2016.html).
>
>
>
> Hope this helps.
>
> Giovanni
>
>
>
> *From:* Tristan Camilleri <[email protected]>
> *Sent:* 02 May 2022 07:03
> *To:* Patrick Walters <[email protected]>
> *Cc:* RDKit Discuss <[email protected]>
> *Subject:* Re: [Rdkit-discuss] Clustering
>
>
>
> Some people who received this message don't often get email from
> [email protected]. Learn why this is important
> <http://aka.ms/LearnAboutSenderIdentification>
>
> Thanks for the feedback. Rather than an explicit need to perform
> clustering, it is more for me to learn how to do it.
>
>
>
> Any pointers to this effect would be greatly appreciated.
>
>
>
> Tristan
>
>
>
> On Sun, 1 May 2022 at 18:18, Patrick Walters <[email protected]> wrote:
>
> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
>
>
> Here are a couple of relevant blog posts.
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2B4neEhHjjQFPEbNLNWFyoJ9uuSCVMmpcQ34%2B32tykdI%3D&reserved=0>
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2021%2F09%2Fsimilarity-search-and-some-cool-pandas.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=jUGKrSrLogywBYsSyHqYdgNaY0IhUmCWovOSto8QmGk%3D&reserved=0>
>
>
>
> Pat
>
>
>
>
>
>
>
> On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <
> [email protected]> wrote:
>
> Thank you both for the feedback.
>
>
>
> My primary aim is to run an LBVS experiment (similarity search) using a
> set of actives and the dataset of cluster representatives.
>
>
>
>
>
>
>
> On Sun, 1 May 2022, 17:09 Patrick Walters, <[email protected]> wrote:
>
> For me, a lot of this depends on what you intend to do with the
> clustering.  If you want to pick a "representative" subset from a larger
> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
> Cheminformatics has a k-means implementation that runs with FAISS.
> Depending on your goal, choosing a subset with a diversity picker may fit
> the bill.  One annoying aspect of diversity pickers is that the initial
> selections tend to consist of strange molecules.
>
>
>
> @Tristen can you provide more information on what you want to do with the
> clustering results?
>
>
>
>
>
> Pat
>
>
>
> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha <[email protected]>
> wrote:
>
> You could consider using FAISS. An example of clustering 2.1M cmpds is
> described at
> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2019%2F04%2Fclustering-21-million-compounds-for-5.html&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TdoVQw3Q6RIeQP5%2FNtHUOMWxjYUfcThKAqsEK2eX2tI%3D&reserved=0>
>
>
>
>
>
> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
> [email protected]> wrote:
>
> Hi,
>
>
>
> I am attempting to cluster a database of circa 4M small molecules and I
> have hit several snags.
>
> Using BulkTanimoto is not possible due to resiurces that are required. I
> am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix). However, I am finding it very challenging to identify an
> appropriate clustering algorithm. I have considered both k-medoids and
> DBSCAN. Each of these has its own limitations, stating the number of
> clusters for k-medoids and not obtaining centroids for DBSCAN.
>
>
>
> I was wondering whether there is an implementation of the stochastic
> clustering analysis for clustering purposes, described in
> https://doi.org/10.1021/ci970056l
> <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1021%2Fci970056l&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8B5%2F6sdCtGNaZbfq1Dxf7n%2BOSuLDDJxB%2B%2BVaAnSs9Sw%3D&reserved=0>
> .
>
>
>
> Any suggestions on the best method for clustering large datasets, with
> code suggestions, would be greatly appreciated. I am new to the subject and
> would appreciate any help.
>
>
>
> Regards,
>
> Tristan
>
>
>
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0>
>
>
>
>
> --
>
> Rajarshi Guha | http://blog.rguha.net
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.rguha.net%2F&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AVte6oJOpTc7uZmL1ns18cJ%2BhKE2GOVLyExP6OF05EU%3D&reserved=0>
>  |
> @rguha
> <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Frguha&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9uxDGNHUjF9DZSEYLmsAmlyfh7svgr0hy8fBtVTpHD0%3D&reserved=0>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> <https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.sourceforge.net%2Flists%2Flistinfo%2Frdkit-discuss&data=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vdRoOQeZbOJySC8tun%2BtWnJh10f%2FZ54N1jzSX33f2LA%3D&reserved=0>
>
> This e-mail and its attachment(s) (if any) may contain confidential and/or
> proprietary information and is intended for its addressee(s) only. Any
> unauthorized use of the information contained herein (including, but not
> limited to, alteration, reproduction, communication, distribution or any
> other form of dissemination) is strictly prohibited. If you are not the
> intended addressee, please notify the originator promptly and delete this
> e-mail and its attachment(s) (if any) subsequently. Neither Galapagos nor
> any of its affiliates shall be liable for direct, special, indirect or
> consequential damages arising from alteration of the contents of this
> message (by a third party) or as a result of a virus being passed on.
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Clustering

Reply via email to