Dear Greg, I’m sorry for causing the confusion, and thanks for your excellent (as always!) explanation. The reason I got into troubles with fingerprint resolution (apart from my incompetence ;) ) is that my dataset is comprised of (obviously problematic) organometallics. Best, Michal
On Thu, 11 Oct 2018 at 16:59, Greg Landrum <greg.land...@gmail.com> wrote: > I've been quiet on this one since I'm traveling this week, but I want to > briefly weigh in on the fingerprint aspects since I think some terms are > being used incorrectly and that's maybe making things even more confusing. > > I believe that the terms "collision" as applied to fingerprints normally > means two different molecular features setting the same bit in the final > fingerprint. In the case of the Morgan fingerprint, this means that two > different atom environments would set the same bit. To understand how > collisions come about, it's worth spending a bit of time describing how a > Morgan fingerprint is generated. > After finding a "circular" atom environment, the fingerprinting code uses > a hash function to convert the environment into a number. Let's call this > the hash value. You can see the hash values for the atom environments of a > molecule (along with how often the environments occur) using the > "GetMorganFingerprint()" function: > > In [4]: m = Chem.MolFromSmiles('Cc1ccccc1') > > In [5]: fp = rdMolDescriptors.GetMorganFingerprint(m,2) > > In [6]: fp.GetNonzeroElements() > Out[6]: > {98513984: 3, > 422715066: 1, > 908339072: 1, > 951226070: 2, > 2246728737: 1, > 2763854213: 1, > 3207567135: 1, > 3217380708: 1, > 3218693969: 5, > 3999906991: 2, > 4244175903: 2} > > When you ask for a fingerprint as a bit vector, those hash values are > truncated so that they fit into the size of the fingerprint you asked for: > > In [7]: bv = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,4096) > > In [8]: bv.GetNumOnBits() > Out[8]: 11 > > In [9]: len(bv) > Out[9]: 4096 > > Notice here that we have the same number of bits set in the bit vector > (11) as we did in the original fingerprint > > A collision happens when two different atom environments hash to the same > value *or* when the truncation to the bit vector results in two different > hash values ending up in the bit. > > The first type of collision doesn't happen all that frequently (and isn't > 100% trivial to detect),[1] but the second happens pretty regularly, > particularly when you make fingerprints short. Here's an example of that > for the simple molecule above: > > In [12]: bv2 = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,256) > > In [13]: bv2.GetNumOnBits() > Out[13]: 10 > > Notice that now only 10 bits are set and remember that we previously had > 11. > > The two factors influencing the number of collisions of the second type > are the size of the fingerprint - smaller fingerprints = more likelihood of > collisions - and the radius of the features being used - higher radii end > up setting more bits, which purely statistically leads to a greater chance > of collisions. > > Collisions by themselves are not necessarily a terrible thing. They do > result in some information loss, have a small impact on similarity, and a > somewhat larger (though still not enormous) impact on machine learning > performance. See the blog posts I mention below for the experiments I did > here to figure this out. > > Two different molecules producing the same fingerprint is a different > thing. This can be caused by collisions alone (though I would guess this > happens fairly regularly), but I think it's more likely that it's a > limitation of the nature of or resolution of the fingerprint. You can test > the resolution question by checking to see if increasing the radius you use > allows the molecules to be distinguished from each other. The first > question is probably most easily answered by generating the "full" > fingerprint by calling GetMorganFingerprint() as I show above and looking > to see how similar the molecules are at that level. > > There's fair amount of information about the impact of bit vector and > fingerprint radius on the number of collisions in these RDKit blog posts: > http://rdkit.blogspot.com/2014/02/colliding-bits.html > http://rdkit.blogspot.com/2014/03/colliding-bits-ii.html > http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html > > I hope this helps a bit, > -greg > [1] Here's a clear example where it has happened: > https://github.com/rdkit/rdkit/issues/814 > > > > On Wed, Oct 10, 2018 at 10:28 AM Michal Krompiec < > michal.kromp...@gmail.com> wrote: > >> Dear All, >> Thank you all very much for your feedback! Actually, the number of >> collisions didn't decrease when I increased the bit length, though >> increasing radius to 3 did help a bit. Overall, it is good to know that >> great results are not to be expected. >> Best wishes, >> Michal >> >> On Wed, 10 Oct 2018 at 13:31, Chris Earnshaw <cgearns...@gmail.com> >> wrote: >> >>> Hi >>> >>> It sounds to me like you're already getting better results than you >>> could reasonably expect. >>> >>> Prediction of melting point is a phenomenally difficult thing to do; >>> you're trying to find the temperature at which a (generally undefined) >>> solid crystalline phase is in equilibrium with a (probably even less >>> defined) liquid phase. You also need to consider that the crystalline form >>> of your solid phase is not necessarily truly constant - what polymorph is >>> involved? Melting points of alternative polymorphs can be radically >>> different and this is one of the real bugbears of pharmaceutical and >>> agrochemical development. If you haven't found the most stable form early >>> in the development process there can be very nasty surprises downstream. >>> >>> Expecting to handle all these challenges with a descriptor as simple as >>> a molecular fingerprint - regardless of bit-length, collisions etc. is >>> probably over optimistic... >>> >>> Regards, >>> Chris Earnshaw >>> >>> On Wed, 10 Oct 2018 at 13:16, Michal Krompiec <michal.kromp...@gmail.com> >>> wrote: >>> >>>> Hi Thomas, >>>> Radius 2, 2048 bits, 5200 data points. >>>> >>>> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis <teva...@gmail.com> >>>> wrote: >>>> >>>>> What's your bitvector length and radius? How many training samples do >>>>> you have? >>>>> >>>>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec < >>>>> michal.kromp...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> I have a slightly off-topic question. I'm trying to train a neural >>>>>> network on a dataset of small molecules and their melting points. I did >>>>>> get >>>>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that >>>>>> regardless of FP radius and bitvector length, several dozen molecules >>>>>> have >>>>>> the same fingerprints but wildly different melting points. I am pretty >>>>>> sure >>>>>> this is a "solved problem" so I don't want to reinvent the wheel. What is >>>>>> the recommended/usual way of dealing with this? >>>>>> Thanks, >>>>>> Michal >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Rdkit-discuss mailing list >>>>>> Rdkit-discuss@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ====================================================================== >>>>> >>>>> Dr Thomas Evangelidis >>>>> >>>>> Research Scientist >>>>> >>>>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech >>>>> Academy of Sciences >>>>> <https://www.uochb.cz/web/structure/31.html?lang=en> >>>>> Prague, Czech Republic >>>>> & >>>>> CEITEC - Central European Institute of Technology >>>>> <https://www.ceitec.eu/> >>>>> Brno, Czech Republic >>>>> >>>>> email: teva...@gmail.com >>>>> >>>>> website: https://sites.google.com/site/thomasevangelidishomepage/ >>>>> >>>>> >>>>> _______________________________________________ >>>> Rdkit-discuss mailing list >>>> Rdkit-discuss@lists.sourceforge.net >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>>> >>> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss