I've been quiet on this one since I'm traveling this week, but I want to
briefly weigh in on the fingerprint aspects since I think some terms are
being used incorrectly and that's maybe making things even more confusing.

I believe that the terms "collision" as applied to fingerprints normally
means two different molecular features setting the same bit in the final
fingerprint. In the case of the Morgan fingerprint, this means that two
different atom environments would set the same bit. To understand how
collisions come about, it's worth spending a bit of time describing how a
Morgan fingerprint is generated.
After finding a "circular" atom environment, the fingerprinting code uses a
hash function to convert the environment into a number. Let's call this the
hash value. You can see the hash values for the atom environments of a
molecule (along with how often the environments occur) using the
"GetMorganFingerprint()" function:

In [4]: m = Chem.MolFromSmiles('Cc1ccccc1')

In [5]: fp = rdMolDescriptors.GetMorganFingerprint(m,2)

In [6]: fp.GetNonzeroElements()
Out[6]:
{98513984: 3,
 422715066: 1,
 908339072: 1,
 951226070: 2,
 2246728737: 1,
 2763854213: 1,
 3207567135: 1,
 3217380708: 1,
 3218693969: 5,
 3999906991: 2,
 4244175903: 2}

When you ask for a fingerprint as a bit vector, those hash values are
truncated so that they fit into the size of the fingerprint you asked for:

In [7]: bv = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,4096)

In [8]: bv.GetNumOnBits()
Out[8]: 11

In [9]: len(bv)
Out[9]: 4096

Notice here that we have the same number of bits set in the bit vector (11)
as we did in the original fingerprint

A collision happens when two different atom environments hash to the same
value *or* when the truncation to the bit vector results in two different
hash values ending up in the bit.

The first type of collision doesn't happen all that frequently (and isn't
100% trivial to detect),[1] but the second happens pretty regularly,
particularly when you make fingerprints short. Here's an example of that
for the simple molecule above:

In [12]: bv2 = rdMolDescriptors.GetMorganFingerprintAsBitVect(m,2,256)

In [13]: bv2.GetNumOnBits()
Out[13]: 10

Notice that now only 10 bits are set and remember that we previously had 11.

The two factors influencing the number of collisions of the second type are
the size of the fingerprint - smaller fingerprints = more likelihood of
collisions - and the radius of the features being used - higher radii end
up setting more bits, which purely statistically leads to a greater chance
of collisions.

Collisions by themselves are not necessarily a terrible thing. They do
result in some information loss, have a small impact on similarity, and a
somewhat larger (though still not enormous) impact on machine learning
performance. See the blog posts I mention below for the experiments I did
here to figure this out.

Two different molecules producing the same fingerprint is a different
thing. This can be caused by collisions alone (though I would guess this
happens fairly regularly), but I think it's more likely that it's a
limitation of the nature of or resolution of the fingerprint. You can test
the resolution question by checking to see if increasing the radius you use
allows the molecules to be distinguished from each other. The first
question is probably most easily answered by generating the "full"
fingerprint by calling GetMorganFingerprint() as I show above and looking
to see how similar the molecules are at that level.

There's fair amount of information about the impact of bit vector and
fingerprint radius on the number of collisions in these RDKit blog posts:
http://rdkit.blogspot.com/2014/02/colliding-bits.html
http://rdkit.blogspot.com/2014/03/colliding-bits-ii.html
http://rdkit.blogspot.com/2016/02/colliding-bits-iii.html

I hope this helps a bit,
-greg
[1] Here's a clear example where it has happened:
https://github.com/rdkit/rdkit/issues/814



On Wed, Oct 10, 2018 at 10:28 AM Michal Krompiec <michal.kromp...@gmail.com>
wrote:

> Dear All,
> Thank you all very much for your feedback! Actually, the number of
> collisions didn't decrease when I increased the bit length, though
> increasing radius to 3 did help a bit. Overall, it is good to know that
> great results are not to be expected.
> Best wishes,
> Michal
>
> On Wed, 10 Oct 2018 at 13:31, Chris Earnshaw <cgearns...@gmail.com> wrote:
>
>> Hi
>>
>> It sounds to me like you're already getting better results than you could
>> reasonably expect.
>>
>> Prediction of melting point is a phenomenally difficult thing to do;
>> you're trying to find the temperature at which a (generally undefined)
>> solid crystalline phase is in equilibrium with a (probably even less
>> defined) liquid phase. You also need to consider that the crystalline form
>> of your solid phase is not necessarily truly constant - what polymorph is
>> involved? Melting points of alternative polymorphs can be radically
>> different and this is one of the real bugbears of pharmaceutical and
>> agrochemical development. If you haven't found the most stable form early
>> in the development process there can be very nasty surprises downstream.
>>
>> Expecting to handle all these challenges with a descriptor as simple as a
>> molecular fingerprint - regardless of bit-length, collisions etc. is
>> probably over optimistic...
>>
>> Regards,
>> Chris Earnshaw
>>
>> On Wed, 10 Oct 2018 at 13:16, Michal Krompiec <michal.kromp...@gmail.com>
>> wrote:
>>
>>> Hi Thomas,
>>> Radius 2, 2048 bits, 5200 data points.
>>>
>>> On Wed, 10 Oct 2018 at 13:13, Thomas Evangelidis <teva...@gmail.com>
>>> wrote:
>>>
>>>> What's your bitvector length and radius? How many training samples do
>>>> you have?
>>>>
>>>> On Wed, 10 Oct 2018 at 13:51, Michal Krompiec <
>>>> michal.kromp...@gmail.com> wrote:
>>>>
>>>>> Hi all,
>>>>> I have a slightly off-topic question. I'm trying to train a neural
>>>>> network on a dataset of small molecules and their melting points. I did 
>>>>> get
>>>>> a not-so-bad accuracy with Morgan fingerprints, but I've realised that
>>>>> regardless of FP radius and bitvector length, several dozen molecules have
>>>>> the same fingerprints but wildly different melting points. I am pretty 
>>>>> sure
>>>>> this is a "solved problem" so I don't want to reinvent the wheel. What is
>>>>> the recommended/usual way of dealing with this?
>>>>> Thanks,
>>>>> Michal
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ======================================================================
>>>>
>>>> Dr Thomas Evangelidis
>>>>
>>>> Research Scientist
>>>>
>>>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>>>> Academy of Sciences
>>>> <https://www.uochb.cz/web/structure/31.html?lang=en>
>>>> Prague, Czech Republic
>>>>   &
>>>> CEITEC - Central European Institute of Technology
>>>> <https://www.ceitec.eu/>
>>>> Brno, Czech Republic
>>>>
>>>> email: teva...@gmail.com
>>>>
>>>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>>>
>>>>
>>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to