Dear Greg,
I’m sorry for causing the confusion, and thanks for your excellent (as
always!) explanation. The reason I got into troubles with fingerprint
resolution (apart from my incompetence ;) ) is that my dataset is comprised
of (obviously problematic) organometallics.
Best,
Michal

On Thu, 11 Oct 2018 at 16:59, Greg Landrum <greg.land...@gmail.com> wrote:

> I've been quiet on this one since I'm traveling this week, but I want to
> briefly weigh in on the fingerprint aspects since I think some terms are
> being used incorrectly and that's maybe making things even more confusing.
>
> I believe that the terms "collision" as applied to fingerprints normally
> means two different molecular features setting the same bit in the final
> fingerprint. In the case of the Morgan fingerprint, this means that two
> different atom environments would set the same bit. To understand how
> collisions come about, it's worth spending a bit of time describing how a
> Morgan fingerprint is generated.
> After finding a "circular" atom environment, the fingerprinting code uses
> a hash function to convert the environment into a number. Let's call this
> the hash value. You can see the hash values for the atom environments of a
> molecule (along with how often the environments occur) using the
> "GetMorganFingerprint()" function:
>
> In [4]: m = Chem.MolFromSmiles('Cc1ccccc1')
>
> In [5]: fp = rdMolDescriptors.GetMorganFingerprint(m,2)
>
> In [6]: fp.GetNonzeroElements()
> Out[6]:
> {98513984: 3,
>  422715066: 1,
>  908339072: 1,
>  951226070: 2,
>  2246728737: 1,
>  2763854213: 1,
>  3207567135: 1,
>  3217380708: 1,
>  3218693969: 5,
>  3999906991: 2,
>  4244175903: 2}
>
> When you ask for a fingerprint as a bit vector, those hash values are
> truncated so that they fit into the size of the fingerprint you asked for:
>
> In [7]: bv = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,4096)
>
> In [8]: bv.GetNumOnBits()
> Out[8]: 11
>
> In [9]: len(bv)
> Out[9]: 4096
>
> Notice here that we have the same number of bits set in the bit vector
> (11) as we did in the original fingerprint
>
> A collision happens when two different atom environments hash to the same
> value *or* when the truncation to the bit vector results in two different
> hash values ending up in the bit.
>
> The first type of collision doesn't happen all that frequently (and isn't
> 100% trivial to detect),[1] but the second happens pretty regularly,
> particularly when you make fingerprints short. Here's an example of that
> for the simple molecule above:
>
> In [12]: bv2 = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,256)
>
> In [13]: bv2.GetNumOnBits()
> Out[13]: 10
>
> Notice that now only 10 bits are set and remember that we previously had
> 11.
>
> The two factors influencing the number of collisions of the second type
> are the size of the fingerprint - smaller fingerprints = more likelihood of
> collisions - and the radius of the features being used - higher radii end
> up setting more bits, which purely statistically leads to a greater chance
> of collisions.
>
> Collisions by themselves are not necessarily a terrible thing. They do
> result in some information loss, have a small impact on similarity, and a
> somewhat larger (though still not enormous) impact on machine learning
> performance. See the blog posts I mention below for the experiments I did
> here to figure this out.
>
> Two different molecules producing the same fingerprint is a different
> thing. This can be caused by collisions alone (though I would guess this
> happens fairly regularly), but I think it's more likely that it's a
> limitation of the nature of or resolution of the fingerprint. You can test
> the resolution question by checking to see if increasing the radius you use
> allows the molecules to be distinguished from each other. The first
> question is probably most easily answered by generating the "full"
> fingerprint by calling GetMorganFingerprint() as I show above and looking
> to see how similar the molecules are at that level.
>
> There's fair amount of information about the impact of bit vector and
> fingerprint radius on the number of collisions in these RDKit blog posts:
> http://rdkit.blogspot.com/2014/02/colliding-bits.html
> http://rdkit.blogspot.com/2014/03/colliding-bits-ii.html
> http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html
>
> I hope this helps a bit,
> -greg
> [1] Here's a clear example where it has happened:
> https://github.com/rdkit/rdkit/issues/814
>
>
>
> On Wed, Oct 10, 2018 at 10:28 AM Michal Krompiec <
> michal.kromp...@gmail.com> wrote:
>
>> Dear All,
>> Thank you all very much for your feedback! Actually, the number of
>> collisions didn't decrease when I increased the bit length, though
>> increasing radius to 3 did help a bit. Overall, it is good to know that
>> great results are not to be expected.
>> Best wishes,
>> Michal
>>
>> On Wed, 10 Oct 2018 at 13:31, Chris Earnshaw <cgearns...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> It sounds to me like you're already getting better results than you
>>> could reasonably expect.
>>>
>>> Prediction of melting point is a phenomenally difficult thing to do;
>>> you're trying to find the temperature at which a (generally undefined)
>>> solid crystalline phase is in equilibrium with a (probably even less
>>> defined) liquid phase. You also need to consider that the crystalline form
>>> of your solid phase is not necessarily truly constant - what polymorph is
>>> involved? Melting points of alternative polymorphs can be radically
>>> different and this is one of the real bugbears of pharmaceutical and
>>> agrochemical development. If you haven't found the most stable form early
>>> in the development process there can be very nasty surprises downstream.
>>>
>>> Expecting to handle all these challenges with a descriptor as simple as
>>> a molecular fingerprint - regardless of bit-length, collisions etc. is
>>> probably over optimistic...
>>>
>>> Regards,
>>> Chris Earnshaw
>>>
>>> On Wed, 10 Oct 2018 at 13:16, Michal Krompiec <michal.kromp...@gmail.com>
>>> wrote:
>>>
>>>> Hi Thomas,
>>>> Radius 2, 2048 bits, 5200 data points.
>>>>
>>>> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis <teva...@gmail.com>
>>>> wrote:
>>>>
>>>>> What's your bitvector length and radius? How many training samples do
>>>>> you have?
>>>>>
>>>>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec <
>>>>> michal.kromp...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> I have a slightly off-topic question. I'm trying to train a neural
>>>>>> network on a dataset of small molecules and their melting points. I did 
>>>>>> get
>>>>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>>>>> regardless of FP radius and bitvector length, several dozen molecules 
>>>>>> have
>>>>>> the same fingerprints but wildly different melting points. I am pretty 
>>>>>> sure
>>>>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>>>>> the recommended/usual way of dealing with this?
>>>>>> Thanks,
>>>>>> Michal
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Rdkit-discuss mailing list
>>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ======================================================================
>>>>>
>>>>> Dr Thomas Evangelidis
>>>>>
>>>>> Research Scientist
>>>>>
>>>>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>>>>> Academy of Sciences
>>>>> <https://www.uochb.cz/web/structure/31.html?lang=en>
>>>>> Prague, Czech Republic
>>>>>   &
>>>>> CEITEC - Central European Institute of Technology
>>>>> <https://www.ceitec.eu/>
>>>>> Brno, Czech Republic
>>>>>
>>>>> email: teva...@gmail.com
>>>>>
>>>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>>>
>>>>>
>>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to