Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thank you both for the feedback.

My primary aim is to run an LBVS experiment (similarity search) using a set
of actives and the dataset of cluster representatives.



On Sun, 1 May 2022, 17:09 Patrick Walters,  wrote:

> For me, a lot of this depends on what you intend to do with the
> clustering.  If you want to pick a "representative" subset from a larger
> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
> Cheminformatics has a k-means implementation that runs with FAISS.
> Depending on your goal, choosing a subset with a diversity picker may fit
> the bill.  One annoying aspect of diversity pickers is that the initial
> selections tend to consist of strange molecules.
>
> @Tristen can you provide more information on what you want to do with the
> clustering results?
>
>
> Pat
>
> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
> wrote:
>
>> You could consider using FAISS. An example of clustering 2.1M cmpds is
>> described at
>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>>
>>
>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
>> tristan.camilleri...@um.edu.mt> wrote:
>>
>>> Hi,
>>>
>>> I am attempting to cluster a database of circa 4M small molecules and I
>>> have hit several snags.
>>> Using BulkTanimoto is not possible due to resiurces that are required. I
>>> am now working with fpsim2 and chemfp to get a distance matrix (sparse
>>> matrix). However, I am finding it very challenging to identify an
>>> appropriate clustering algorithm. I have considered both k-medoids and
>>> DBSCAN. Each of these has its own limitations, stating the number of
>>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>>
>>> I was wondering whether there is an implementation of the stochastic
>>> clustering analysis for clustering purposes, described in
>>> https://doi.org/10.1021/ci970056l .
>>>
>>> Any suggestions on the best method for clustering large datasets, with
>>> code suggestions, would be greatly appreciated. I am new to the subject and
>>> would appreciate any help.
>>>
>>> Regards,
>>> Tristan
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> Rajarshi Guha | http://blog.rguha.net | @rguha
>> <https://twitter.com/rguha>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Hi,

I am attempting to cluster a database of circa 4M small molecules and I
have hit several snags.
Using BulkTanimoto is not possible due to resiurces that are required. I am
now working with fpsim2 and chemfp to get a distance matrix (sparse
matrix). However, I am finding it very challenging to identify an
appropriate clustering algorithm. I have considered both k-medoids and
DBSCAN. Each of these has its own limitations, stating the number of
clusters for k-medoids and not obtaining centroids for DBSCAN.

I was wondering whether there is an implementation of the stochastic
clustering analysis for clustering purposes, described in
https://doi.org/10.1021/ci970056l .

Any suggestions on the best method for clustering large datasets, with code
suggestions, would be greatly appreciated. I am new to the subject and
would appreciate any help.

Regards,
Tristan
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thanks for the feedback. Rather than an explicit need to perform
clustering, it is more for me to learn how to do it.

Any pointers to this effect would be greatly appreciated.

Tristan

On Sun, 1 May 2022 at 18:18, Patrick Walters  wrote:

> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
> Here are a couple of relevant blog posts.
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
>
>
> http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html
>
> Pat
>
>
>
> On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <
> tristan.camilleri...@um.edu.mt> wrote:
>
>> Thank you both for the feedback.
>>
>> My primary aim is to run an LBVS experiment (similarity search) using a
>> set of actives and the dataset of cluster representatives.
>>
>>
>>
>> On Sun, 1 May 2022, 17:09 Patrick Walters,  wrote:
>>
>>> For me, a lot of this depends on what you intend to do with the
>>> clustering.  If you want to pick a "representative" subset from a larger
>>> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
>>> Cheminformatics has a k-means implementation that runs with FAISS.
>>> Depending on your goal, choosing a subset with a diversity picker may fit
>>> the bill.  One annoying aspect of diversity pickers is that the initial
>>> selections tend to consist of strange molecules.
>>>
>>> @Tristen can you provide more information on what you want to do with
>>> the clustering results?
>>>
>>>
>>> Pat
>>>
>>> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
>>> wrote:
>>>
>>>> You could consider using FAISS. An example of clustering 2.1M cmpds is
>>>> described at
>>>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>>>>
>>>>
>>>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
>>>> tristan.camilleri...@um.edu.mt> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am attempting to cluster a database of circa 4M small molecules and
>>>>> I have hit several snags.
>>>>> Using BulkTanimoto is not possible due to resiurces that are required.
>>>>> I am now working with fpsim2 and chemfp to get a distance matrix (sparse
>>>>> matrix). However, I am finding it very challenging to identify an
>>>>> appropriate clustering algorithm. I have considered both k-medoids and
>>>>> DBSCAN. Each of these has its own limitations, stating the number of
>>>>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>>>>
>>>>> I was wondering whether there is an implementation of the stochastic
>>>>> clustering analysis for clustering purposes, described in
>>>>> https://doi.org/10.1021/ci970056l .
>>>>>
>>>>> Any suggestions on the best method for clustering large datasets, with
>>>>> code suggestions, would be greatly appreciated. I am new to the subject 
>>>>> and
>>>>> would appreciate any help.
>>>>>
>>>>> Regards,
>>>>> Tristan
>>>>>
>>>>>
>>>>> ___
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>
>>>>
>>>>
>>>> --
>>>> Rajarshi Guha | http://blog.rguha.net | @rguha
>>>> <https://twitter.com/rguha>
>>>>
>>>> ___
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Tristan Camilleri
Dear Giovanni,

Many thanks for the feedback which is very insightful. I had skipped Greg
Landrum’s post on sphere exclusion clustering, I will certainly give it a
try.

Regards,
Tristan

On Mon, 2 May 2022 at 09:45, Giovanni Tricarico 
wrote:

> Hello Tristan,
>
> I imagine you have seen Greg Landrum’s post on sphere exclusion
> clustering:
> https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
>
> The initial MaxMin selection is very fast, especially Roger Sayle’s
> implementation, and not memory-hungry, and that gives you maximally diverse
> cluster centres (or ‘representatives’ as you call them).
>
> [Yes, it’s true that the centres can be ‘weird’, but if they *are* in
> your set, maybe the issue is in the composition of the set itself, not in
> what you do with it; maybe you should remove singletons (initially picked
> molecules that have Tanimoto distance 1 or very close to 1 from all other
> molecules) upfront].
>
> Unless I am missing something, it sounds surprising that you cannot use
> BulkTanimotoSimilarity to do the clustering.
>
> The size of the similarity matrix is not (4M)^2, but only ~4M *
> len(picks); so unless you picked almost everything…(?)
>
> In fact to counteract even this potential issue, we made our own simple
> variation of the above code that only processes one molecule at a time, and
> applied it without issues to a 1M set; surely you should be able to make a
> temporary list of floating point numbers of length len(pick).
>
>
>
> On the other hand, like others also commented, it’s not clear why you want
> to do the clustering in the first place.
>
> Searching by fingerprint similarity in a set of 4 M molecules should not
> be that slow.
>
>
>
> *As for:*
>
> *“I am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix)”*
>
>
>
> IMO you do not necessarily need to have a full N^2 distance matrix to do
> what you describe, as indeed shown by Greg’s post.
>
> And more importantly, a distance matrix for a molecular set is rarely
> ‘sparse’ (where ‘sparse’ in this context means mostly full of 0’s, and with
> only few non-0 elements), because that only happens when most molecules
> have identical FP’s, which should be very rare, unless your set is very
> redundant.
>
>
>
> In this sense, I would suggest an alternative clustering approach (if you
> are convinced that you *do* want to do clustering).
>
> Make a sparse feature matrix (binary matrix encoding the ‘on’ FP bits of
> your molecules vs their index).
>
> From that, make not a *distance*, but a (Tanimoto) *similarity* matrix,
> with a threshold below which it is capped to 0 (I can only suggest R
> package proxyC for this
> https://cran.r-project.org/web/packages/proxyC/index.html; maybe there is
> a python implementation, I don’t know).
>
> That matrix can indeed be very sparse: only molecules that have a
> similarity higher than your threshold have a value, the rest is all 0 and
> is not stored. Obviously you should not use too low a threshold, otherwise
> you are back to a dense (4M)^2 matrix, which I assume you cannot make or
> store.
>
> From this (which is essentially an *adjacency* matrix - telling you which
> molecules are or are not ‘linked’) you can do graph representations and
> even clustering, e.g. using igraph (again, I can only suggest an R package
> for that https://igraph.org/r/, https://kateto.net/netscix2016.html).
>
>
>
> Hope this helps.
>
> Giovanni
>
>
>
> *From:* Tristan Camilleri 
> *Sent:* 02 May 2022 07:03
> *To:* Patrick Walters 
> *Cc:* RDKit Discuss 
> *Subject:* Re: [Rdkit-discuss] Clustering
>
>
>
> Some people who received this message don't often get email from
> tristan.camilleri...@um.edu.mt. Learn why this is important
> <http://aka.ms/LearnAboutSenderIdentification>
>
> Thanks for the feedback. Rather than an explicit need to perform
> clustering, it is more for me to learn how to do it.
>
>
>
> Any pointers to this effect would be greatly appreciated.
>
>
>
> Tristan
>
>
>
> On Sun, 1 May 2022 at 18:18, Patrick Walters  wrote:
>
> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
>
>
> Here are a couple of relevant blog posts.
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bcc