Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Tristan Camilleri
Dear Giovanni,

Many thanks for the feedback which is very insightful. I had skipped Greg
Landrum’s post on sphere exclusion clustering, I will certainly give it a
try.

Regards,
Tristan

On Mon, 2 May 2022 at 09:45, Giovanni Tricarico 
wrote:

> Hello Tristan,
>
> I imagine you have seen Greg Landrum’s post on sphere exclusion
> clustering:
> https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
>
> The initial MaxMin selection is very fast, especially Roger Sayle’s
> implementation, and not memory-hungry, and that gives you maximally diverse
> cluster centres (or ‘representatives’ as you call them).
>
> [Yes, it’s true that the centres can be ‘weird’, but if they *are* in
> your set, maybe the issue is in the composition of the set itself, not in
> what you do with it; maybe you should remove singletons (initially picked
> molecules that have Tanimoto distance 1 or very close to 1 from all other
> molecules) upfront].
>
> Unless I am missing something, it sounds surprising that you cannot use
> BulkTanimotoSimilarity to do the clustering.
>
> The size of the similarity matrix is not (4M)^2, but only ~4M *
> len(picks); so unless you picked almost everything…(?)
>
> In fact to counteract even this potential issue, we made our own simple
> variation of the above code that only processes one molecule at a time, and
> applied it without issues to a 1M set; surely you should be able to make a
> temporary list of floating point numbers of length len(pick).
>
>
>
> On the other hand, like others also commented, it’s not clear why you want
> to do the clustering in the first place.
>
> Searching by fingerprint similarity in a set of 4 M molecules should not
> be that slow.
>
>
>
> *As for:*
>
> *“I am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix)”*
>
>
>
> IMO you do not necessarily need to have a full N^2 distance matrix to do
> what you describe, as indeed shown by Greg’s post.
>
> And more importantly, a distance matrix for a molecular set is rarely
> ‘sparse’ (where ‘sparse’ in this context means mostly full of 0’s, and with
> only few non-0 elements), because that only happens when most molecules
> have identical FP’s, which should be very rare, unless your set is very
> redundant.
>
>
>
> In this sense, I would suggest an alternative clustering approach (if you
> are convinced that you *do* want to do clustering).
>
> Make a sparse feature matrix (binary matrix encoding the ‘on’ FP bits of
> your molecules vs their index).
>
> From that, make not a *distance*, but a (Tanimoto) *similarity* matrix,
> with a threshold below which it is capped to 0 (I can only suggest R
> package proxyC for this
> https://cran.r-project.org/web/packages/proxyC/index.html; maybe there is
> a python implementation, I don’t know).
>
> That matrix can indeed be very sparse: only molecules that have a
> similarity higher than your threshold have a value, the rest is all 0 and
> is not stored. Obviously you should not use too low a threshold, otherwise
> you are back to a dense (4M)^2 matrix, which I assume you cannot make or
> store.
>
> From this (which is essentially an *adjacency* matrix - telling you which
> molecules are or are not ‘linked’) you can do graph representations and
> even clustering, e.g. using igraph (again, I can only suggest an R package
> for that https://igraph.org/r/, https://kateto.net/netscix2016.html).
>
>
>
> Hope this helps.
>
> Giovanni
>
>
>
> *From:* Tristan Camilleri 
> *Sent:* 02 May 2022 07:03
> *To:* Patrick Walters 
> *Cc:* RDKit Discuss 
> *Subject:* Re: [Rdkit-discuss] Clustering
>
>
>
> Some people who received this message don't often get email from
> tristan.camilleri...@um.edu.mt. Learn why this is important
> 
>
> Thanks for the feedback. Rather than an explicit need to perform
> clustering, it is more for me to learn how to do it.
>
>
>
> Any pointers to this effect would be greatly appreciated.
>
>
>
> Tristan
>
>
>
> On Sun, 1 May 2022 at 18:18, Patrick Walters  wrote:
>
> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
>
>
> Here are a couple of relevant blog posts.
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
> 
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html
> 

Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Giovanni Tricarico
Hello Tristan,
I imagine you have seen Greg Landrum's post on sphere exclusion clustering: 
https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
The initial MaxMin selection is very fast, especially Roger Sayle's 
implementation, and not memory-hungry, and that gives you maximally diverse 
cluster centres (or 'representatives' as you call them).
[Yes, it's true that the centres can be 'weird', but if they are in your set, 
maybe the issue is in the composition of the set itself, not in what you do 
with it; maybe you should remove singletons (initially picked molecules that 
have Tanimoto distance 1 or very close to 1 from all other molecules) upfront].
Unless I am missing something, it sounds surprising that you cannot use 
BulkTanimotoSimilarity to do the clustering.
The size of the similarity matrix is not (4M)^2, but only ~4M * len(picks); so 
unless you picked almost everything...(?)
In fact to counteract even this potential issue, we made our own simple 
variation of the above code that only processes one molecule at a time, and 
applied it without issues to a 1M set; surely you should be able to make a 
temporary list of floating point numbers of length len(pick).

On the other hand, like others also commented, it's not clear why you want to 
do the clustering in the first place.
Searching by fingerprint similarity in a set of 4 M molecules should not be 
that slow.

As for:
"I am now working with fpsim2 and chemfp to get a distance matrix (sparse 
matrix)"

IMO you do not necessarily need to have a full N^2 distance matrix to do what 
you describe, as indeed shown by Greg's post.
And more importantly, a distance matrix for a molecular set is rarely 'sparse' 
(where 'sparse' in this context means mostly full of 0's, and with only few 
non-0 elements), because that only happens when most molecules have identical 
FP's, which should be very rare, unless your set is very redundant.

In this sense, I would suggest an alternative clustering approach (if you are 
convinced that you do want to do clustering).
Make a sparse feature matrix (binary matrix encoding the 'on' FP bits of your 
molecules vs their index).
>From that, make not a distance, but a (Tanimoto) similarity matrix, with a 
>threshold below which it is capped to 0 (I can only suggest R package proxyC 
>for this https://cran.r-project.org/web/packages/proxyC/index.html; maybe 
>there is a python implementation, I don't know).
That matrix can indeed be very sparse: only molecules that have a similarity 
higher than your threshold have a value, the rest is all 0 and is not stored. 
Obviously you should not use too low a threshold, otherwise you are back to a 
dense (4M)^2 matrix, which I assume you cannot make or store.
>From this (which is essentially an adjacency matrix - telling you which 
>molecules are or are not 'linked') you can do graph representations and even 
>clustering, e.g. using igraph (again, I can only suggest an R package for that 
>https://igraph.org/r/, https://kateto.net/netscix2016.html).

Hope this helps.
Giovanni

From: Tristan Camilleri 
Sent: 02 May 2022 07:03
To: Patrick Walters 
Cc: RDKit Discuss 
Subject: Re: [Rdkit-discuss] Clustering

Some people who received this message don't often get email from 
tristan.camilleri...@um.edu.mt. Learn 
why this is important
Thanks for the feedback. Rather than an explicit need to perform clustering, it 
is more for me to learn how to do it.

Any pointers to this effect would be greatly appreciated.

Tristan

On Sun, 1 May 2022 at 18:18, Patrick Walters 
mailto:wpwalt...@gmail.com>> wrote:
Similarity search on a database of 4 million is pretty quick with ChemFp or 
fpsim2.  Do you need to do the clustering?

Here are a couple of relevant blog posts.

http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html

http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html

Pat



On Sun, May 1,