Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-27 Thread Chris Earnshaw
Hi There's a degree of confusion here depending on whether people are considering *similarity* (tanimoto, cosine, whatever) or *distance* (however that's defined). The two are very different from each other. Cosine *similarity* obeys the triangle inequality; cosine *distance* doesn't - pretty

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Francois Berenger
On 21/09/2018 16:53, Chris Earnshaw wrote: Hi I'm afraid I can't help with an RDkit solution to your question, but there are a couple of issues which should be born in mind: 1) The centroid of a cluster is a vector mean of the fingerprints of all the members of the cluster and probably will not

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Andrew Dalke
On Sep 26, 2018, at 20:26, Peter S. Shenkin wrote: > Ah, David, but how do you define a "real" singleton? There can be many different definitions of what a '"real" singleton' might be, but we are specifically talking about Butina clustering. The Butina paper defines the term "false singleton",

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Peter S. Shenkin
Ah, David, but how do you define a "real" singleton? -P. On Wed, Sep 26, 2018 at 1:30 PM David Cosgrove wrote: > Slightly off topic, but a minor issue with the Taylor-Butina algorithm is > that it generates “false singletons”. These are molecules just outside the > clustering cutoff that are

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread David Cosgrove
Slightly off topic, but a minor issue with the Taylor-Butina algorithm is that it generates “false singletons”. These are molecules just outside the clustering cutoff that are stranded when their neighbours are put in a different, larger cluster. We used to find it convenient to have a sweep of

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Peter S. Shenkin
Well, I'm not really familiar with the Taylor-Butina clustering method, so I'm proposing a methodology based on generalizing something that I found to be useful in a somewhat different clustering context. Presuming that what you are clustering is the fingerprints of structures, and that you know

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 25, 2018, at 17:13, Peter S. Shenkin wrote: > FWIW, in work on conformational clustering, I used the “most representative” > molecule; that is, the real molecule closest to the mathematical centroid. > This would probably be the best way of displaying a single molecule that > typifies

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Peter S. Shenkin
(I see that I accidentally responded to Andrew, only, earlier; I'm copying to the group this time.) FWIW, in work on conformational clustering, I used the “most representative” molecule; that is, the real molecule closest to the mathematical centroid. This would probably be the best way of

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 21, 2018, at 14:53, Philipp Thiel wrote: > you probably read about the Tanimoto being a proper metric in case of having > binary data > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the > revised edition. What we call Tanimoto is more broadly known as the

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-21 Thread Philipp Thiel
uot; > Cc: "Rdkit-discuss@lists.sourceforge.net" > , > "James T. Metz" > Sent: Friday, 21 September, 2018 13:45:18 > Subject: Re: [Rdkit-discuss] Butina clustering with additional output > I used to have a paper that demonstrated that the tanimoto coefficient does,

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-21 Thread David Cosgrove
I used to have a paper that demonstrated that the tanimoto coefficient does, in fact, obey the triangle inequality. I fear I lost access to it when I retired but maybe a determined google expert could rediscover it. I expect James means what we used to call the cluster seed, i.e. the molecule the

Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-21 Thread Chris Earnshaw
Hi I'm afraid I can't help with an RDkit solution to your question, but there are a couple of issues which should be born in mind: 1) The centroid of a cluster is a vector mean of the fingerprints of all the members of the cluster and probably will not be represented *exactly* by any member of

[Rdkit-discuss] Butina clustering with additional output

2018-09-20 Thread James T. Metz via Rdkit-discuss
RDkit Discussion Group,     I note that RDkit can perform Butina clustering.  Given an SDF ofsmall molecules I would like to cluster the ligands, but obtain additionalinformation from the clustering algorithm.  In particular, I would like to obtainthe cluster number and Tanimoto distance from