I used to have a paper that demonstrated that the tanimoto coefficient
does, in fact, obey the triangle inequality. I fear I lost access to it
when I retired but maybe a determined google expert could rediscover it.
I expect James means what we used to call the cluster seed, i.e. the
molecule the cluster was based on, rather than the mathematical centroid.
Calculating distances from each cluster member to that would be quite
straightforward as a post-processing step although that would roughly
double the time taken.
Regards ,
Dave


On Fri, 21 Sep 2018 at 09:55, Chris Earnshaw <cgearns...@gmail.com> wrote:

> Hi
>
> I'm afraid I can't help with an RDkit solution to your question, but there
> are a couple of issues which should be born in mind:
> 1) The centroid of a cluster is a vector mean of the fingerprints of all
> the members of the cluster and probably will not be represented *exactly*
> by any member of the cluster; in this case no structures will have a
> distance of 0.0 from the centroid. Do you want to calculate the distances
> from the true centroid or from the structure(s) closest to the centroid?
> 2) The Tanimoto metric doesn't obey the triangle inequality and is
> therefore sub-optimal for this kind of analysis. It's better to use an
> alternative which does obey the triangle inequality - e.g. the Cosine
> metric.
>
> Regards,
> Chris Earnshaw
>
>
> On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss <
> rdkit-discuss@lists.sourceforge.net> wrote:
>
>> RDkit Discussion Group,
>>
>>     I note that RDkit can perform Butina clustering.  Given an SDF of
>> small molecules I would like to cluster the ligands, but obtain additional
>> information from the clustering algorithm.  In particular, I would like
>> to obtain
>> the cluster number and Tanimoto distance from the centroid for every
>> ligand
>> in the SDF.  The centroid would obviously have a distance of 0.00.
>>
>>     Has anyone written additional RDkit code to extract this additional
>> information?
>> Thank you.
>>
>>     Regards,
>>     Jim Metz
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to