ECFP, like all fingerprints are notoriously sensitive to "trivial" structural changes so your example is no surprise (the extra carbonyl group also affects aromaticity depending on your view of aromaticity). But I don't think there is anything simple and fast that's better. Maybe a clustering approach may work? Something like sphere exclusion clustering with counting the number of clusters at 0.9 - 0.8 similarity)? With 30K structures it sounds computationally tractable?

Tim

On 25/05/2015 15:10, JP wrote:
RDKitters,

I have a partial RDKit / partial Methodology question. I hope this email isn't much of the "how long is a piece of string" nature.

I have a set of molecules (~30,000) which I would like to get a structural "diversity index" for. So I thought easy - generate some fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy (0.7), select a similarity metric I fancy (Tanimoto) and apply these to the set in a pairwise fashion (you can only do this for a small-ish number of molecules). The resulting distribution of Tanimoto scores defines the similarity (or dissimilarity) of the set.

First of all is there a better way to do this? Does anyone have a feel for the numbers to use (fingerprint type, radius, no of bits)? Is there some 'Industry standard'? Which method should I use GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I wanted ECFP like fingerprints) ? What determines when to use one over the other?

All my scores are rather low even for relatively similar structures -- so I think one of my parameters must be off. Just adding (or removing) a carbonyl drops my score to 0.43. I made this notebook example: http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7

To the RDKit question: GetMorganFingerprintAsBitVect and GetMorganFingerprint give different tanimoto scores (with same radius: 2). This is of course because for the explicit bit vector we can set the length of the vector/fingerprint. Is there an equivalence between the two? (say using n bits gives same results as GetMorganFingerprint). How come the GetMorganFingerprint method has no user-defined length for the fingerprint? What are the hashed equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ?

Take care,
JP

ps A small suggestion, if I am allowed. The fingerprint classes could do with an informative toString (or non Java equivalent) - I know there is ToBitString, but you need to call that explicitly when printing



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y


_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to