Re: [Rdkit-discuss] Butina clustering with additional output

David Cosgrove Wed, 26 Sep 2018 10:31:51 -0700

Slightly off topic, but a minor issue with the Taylor-Butina algorithm is
that it generates “false singletons”. These are molecules just outside the
clustering cutoff that are stranded when their neighbours are put in a
different, larger cluster. We used to find it convenient to have a sweep of
these, at a slightly looser cutoff, and drop them into the cluster whose
centroid/seed they were nearest too. This could be added to Andrew’s code
quite easily. At the very least, it’s worth keeping track of the initial
number of neighbours within the cluster cutoff that each fingerprint had so
as to distinguish real singletons from these artefactual ones.
Dave



On Tue, 25 Sep 2018 at 19:56, Peter S. Shenkin <shen...@gmail.com> wrote:

> Well, I'm not really familiar with the Taylor-Butina clustering method, so
> I'm proposing a methodology based on generalizing something that I found to
> be useful in a somewhat different clustering context.
>
> Presuming that what you are clustering is the fingerprints of structures,
> and that you know which structures are in each cluster, you'd compute the
> average of all the fingerprints. That is, each bit position would be given
> a floating point number that is the average of the 0s and 1s at that
> position computed over the structures in the cluster.  Then you'd compute
> the distance (say, Manhattan or Euclidian) between the fingerprint of each
> structure in the cluster and the average so computed. The "most
> representative structure" would be the cluster member whose distance is
> closest to the cluster's average fingerprint. (Some additional mileage
> could be gained by seeing just how far away from the averag the "most
> representative structures" are. It might be more representative (i.e.,
> closer) for some clusters than for others.
>
> It would make sense to try this (since it's easy enough) and see whether
> the resulting "most representative structures" from the clusters really are
> at least roughly representative, by comparing them with viewable random
> subsets of structures from the clusters.
>
> -P.
>
> On Tue, Sep 25, 2018 at 2:36 PM, Andrew Dalke <da...@dalkescientific.com>
> wrote:
>
>> On Sep 25, 2018, at 17:13, Peter S. Shenkin <shen...@gmail.com> wrote:
>> > FWIW, in work on conformational clustering, I used the “most
>> representative” molecule; that is, the real molecule closest to the
>> mathematical centroid. This would probably be the best way of displaying a
>> single molecule that typifies what is in the cluster.
>>
>> In some sense I'm rephrasing Chris Earnshaw's earlier question - how does
>> one do that with Taylor-Butina clustering? And does it make sense?
>>
>> The algorithm starts by picking a centroid based on the fingerprints with
>> the highest number of neighbors, so none of the other cluster members
>> should have more neighbors within that cutoff.
>>
>> I am far from an expert on this topic, but with any alternative I can
>> think of makes me think I should have started with something other than
>> Taylor-Butina.
>>
>>
>>
>>                                 Andrew
>>                                 da...@dalkescientific.com
>>
>>
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Butina clustering with additional output

Reply via email to