Hi, you probably read about the Tanimoto being a proper metric in case of having binary data in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the revised edition.
Best, Philipp Thiel > From: "David Cosgrove" <[email protected]> > To: "Chris Earnshaw" <[email protected]> > Cc: "[email protected]" > <[email protected]>, > "James T. Metz" <[email protected]> > Sent: Friday, 21 September, 2018 13:45:18 > Subject: Re: [Rdkit-discuss] Butina clustering with additional output > I used to have a paper that demonstrated that the tanimoto coefficient does, > in > fact, obey the triangle inequality. I fear I lost access to it when I retired > but maybe a determined google expert could rediscover it. > I expect James means what we used to call the cluster seed, i.e. the molecule > the cluster was based on, rather than the mathematical centroid. Calculating > distances from each cluster member to that would be quite straightforward as a > post-processing step although that would roughly double the time taken. > Regards , > Dave > On Fri, 21 Sep 2018 at 09:55, Chris Earnshaw < [ mailto:[email protected] | > [email protected] ] > wrote: >> Hi >> I'm afraid I can't help with an RDkit solution to your question, but there >> are a >> couple of issues which should be born in mind: >> 1) The centroid of a cluster is a vector mean of the fingerprints of all the >> members of the cluster and probably will not be represented exactly by any >> member of the cluster; in this case no structures will have a distance of 0.0 >> from the centroid. Do you want to calculate the distances from the true >> centroid or from the structure(s) closest to the centroid? >> 2) The Tanimoto metric doesn't obey the triangle inequality and is therefore >> sub-optimal for this kind of analysis. It's better to use an alternative >> which >> does obey the triangle inequality - e.g. the Cosine metric. >> Regards, >> Chris Earnshaw >> On Thu, 20 Sep 2018 at 21:55, James T. Metz via Rdkit-discuss < [ >> mailto:[email protected] | >> [email protected] ] > wrote: >>> RDkit Discussion Group, >>> I note that RDkit can perform Butina clustering. Given an SDF of >>> small molecules I would like to cluster the ligands, but obtain additional >>> information from the clustering algorithm. In particular, I would like to >>> obtain >>> the cluster number and Tanimoto distance from the centroid for every ligand >>> in the SDF. The centroid would obviously have a distance of 0.00. >>> Has anyone written additional RDkit code to extract this additional >>> information? >>> Thank you. >>> Regards, >>> Jim Metz >>> _______________________________________________ >>> Rdkit-discuss mailing list >>> [ mailto:[email protected] | >>> [email protected] ] >>> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] >> _______________________________________________ >> Rdkit-discuss mailing list >> [ mailto:[email protected] | >> [email protected] ] >> [ https://lists.sourceforge.net/lists/listinfo/rdkit-discuss | >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ] > -- > David Cosgrove > Freelance computational chemistry and chemoinformatics developer > [ http://cozchemix.co.uk/ | http://cozchemix.co.uk ] > _______________________________________________ > Rdkit-discuss mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

