ECFP, like all fingerprints are notoriously sensitive to "trivial"
structural changes so your example is no surprise (the extra carbonyl
group also affects aromaticity depending on your view of aromaticity).
But I don't think there is anything simple and fast that's better.
Maybe a clustering approach may work? Something like sphere exclusion
clustering with counting the number of clusters at 0.9 - 0.8
similarity)? With 30K structures it sounds computationally tractable?
Tim
On 25/05/2015 15:10, JP wrote:
RDKitters,
I have a partial RDKit / partial Methodology question. I hope this
email isn't much of the "how long is a piece of string" nature.
I have a set of molecules (~30,000) which I would like to get a
structural "diversity index" for. So I thought easy - generate some
fingerprint I fancy (ECFP-like, rad 2), take a threshold I fancy
(0.7), select a similarity metric I fancy (Tanimoto) and apply these
to the set in a pairwise fashion (you can only do this for a small-ish
number of molecules). The resulting distribution of Tanimoto scores
defines the similarity (or dissimilarity) of the set.
First of all is there a better way to do this? Does anyone have a feel
for the numbers to use (fingerprint type, radius, no of bits)? Is
there some 'Industry standard'? Which method should I use
GetMorganFingerprintAsBitVect or GetMorganFingerprint (considering I
wanted ECFP like fingerprints) ? What determines when to use one over
the other?
All my scores are rather low even for relatively similar structures --
so I think one of my parameters must be off. Just adding (or removing)
a carbonyl drops my score to 0.43.
I made this notebook example:
http://nbviewer.ipython.org/gist/malteseunderdog/6af446c0dbb1ac9840e7
To the RDKit question: GetMorganFingerprintAsBitVect and
GetMorganFingerprint give different tanimoto scores (with same radius:
2). This is of course because for the explicit bit vector we can set
the length of the vector/fingerprint. Is there an equivalence between
the two? (say using n bits gives same results as
GetMorganFingerprint). How come the GetMorganFingerprint method has
no user-defined length for the fingerprint? What are the hashed
equivalents of these fingerprints (e.g. GetHashedMorganFingerprint) ?
Take care,
JP
ps A small suggestion, if I am allowed. The fingerprint classes could
do with an informative toString (or non Java equivalent) - I know
there is ToBitString, but you need to call that explicitly when printing
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss