On Sep 26, 2018, at 20:26, Peter S. Shenkin <[email protected]> wrote:
> Ah, David, but how do you define a "real" singleton?

There can be many different definitions of what a '"real" singleton' might be, 
but we are specifically talking about Butina clustering.

The Butina paper defines the term "false singleton", which Dave quoted. The 
relevant text from DOI: 10.1021/ci9803381 is:

"""The molecules that have not been flagged by the end of the clustering 
process, either as a cluster centroid or as a cluster member, become 
singletons. It is important to emphasize at this stage that one of the 
consequences of this approach is that some molecules defined as singletons may 
have neighbors at the given Tanimoto similarity index, but those neighbors have 
been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors 
in its list. .... the problem with the creation of a number of false singletons 
that do in fact have similar compounds within the set is easily offset by the 
final quality of the clusters that this approach generates."""

As you can see, there are two types of singletons, and one is called "false 
singleton". No specific name is used for the other type of singleton, but it's 
easy to how they can be called "real" singletons, without confusion or 
misunderstanding.

(FWIW, my implementation, mentioned in an earlier email, uses the term "true 
singleton" as the singleton which is not a "false singleton", but the 
difference is only in the label.)

To confirm that this is what Dave means, I'll quote from his paper 

Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of 
compound libraries for fragment screening. Journal of Computer-Aided Molecular 
Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5

"""The clustering program flush_clus is an implementation of the 
sphere-exclusion algorithm of Taylor [41], which has also been reported 
independently by Butina ... One consequence of the algorithm is the production 
of ‘false singleton clusters.’ The final clusters in the output are invariably 
singleton clusters, where the only member is the seed. Some of these will be 
true singletons, i.e. molecules lacking neighbors within the clustering 
threshold, but others (the false singletons) will be singletons by virtue of 
the fact that their neighbors were placed in other larger clusters in a 
previous iteration of the algorithm. The flush_clus program offers the 
opportunity of performing a final sweep through the clusters using a larger 
similarity threshold and placing the singleton molecules within the cluster for 
which it has the greatest similarity with the seed, so long as this is within 
the threshold."""

Cheers,


                                Andrew
                                [email protected]



_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to