Hi Chris,

  The FP2 fingerprint works along these lines:

1) Choose a fingerprint size 'n', which is a power of 2.
2) Allocate a vector of w = n/32 words to store the bitstring
3) For each linear subpath up to length 7 (these correspond
     to n-grams for words):
  a) use a hash based on the atom and bond properties to compute a value H
  b) "fold" this to the fingerprint size, h = H % (2**n)
  c) set bit h in the vector to 1,
       that is set word ⌊h/32⌋ bit (h%32) to 1

Fingerprints need not be unique. Do not think of them like cryptographic 
fingerprints, where fp(A) == fp(B) has a high probability that A == B.

Instead, chemical fingerprints are developed for two primary purposes:

A) substructure fingerprints have the property that fp(A) not contained in 
fp(B) means that A is not a substructure of B

B) similarity fingerprints have the property that similarity(fp(A), fp(B)) is 
correlated with some concept of chemical similarity between A and B.

Most people use fingerprints for purpose B, which is how people usually use FP2 
fingerprints.



> On Dec 6, 2021, at 14:59, Wolcott, Chris (NIH/NCI) [C] via OpenBabel-discuss 
> <openbabel-discuss@lists.sourceforge.net> wrote:
> I did find an error that the C#/MongoDB interface did not understand Unsigned 
> Integers and stored everything as signed.  Doesn't look like it was a big 
> problem because when the data was retrieved it was converted back to unsigned 
> before being passed to Tanimoto.

That explains why I did not understand the source of your list of fingerprint 
values.

Those values come from the vector of 32-bit integers used to store the full 
fingerprint.

They are usually represented as bit index.

In this case these are the bits 5, 9, 11, 13, ..., 1010, as in the following 
bit of Python:

>>> values = [8399392, 537051136, 393233, 134218496, 2415919137, 8388608, 
>>> 1073741824, 805323777,
...    168820760, 931135619, 941393456, 1073741856, 513, 31465472, 33554432, 
270532616,
...    1016076, 2151158792, 25698305, 2516617274, 1073983488, 2097156, 
16843232, 2097152,
...    536875016, 0, 2097168, 1835200, 2214659584, 1065216, 16808960, 491586]
>>> [bitno for bitno in range(1024) if (values [bitno// 32] & (1 << (bitno % 
>>> 32)))]
[5, 9, 11, 13, 23, 46, 47, 49, 61, 64, 68, 81, 82, 104, 105, 123, 128, 133, 
156, 159, 183, 222, 224, 234, 238, 252, 253, 259, 260, 276, 281, 283, 288, 289, 
295, 311, 312, 313, 314, 316, 317, 324, 325, 329, 330, 335, 338, 339, 340, 347, 
348, 349, 357, 382, 384, 393, 429, 437, 438, 439, 440, 473, 483, 501, 508, 514, 
515, 520, 527, 528, 529, 530, 531, 547, 554, 556, 563, 564, 565, 575, 576, 589, 
595, 599, 600, 609, 611, 612, 613, 619, 623, 633, 634, 636, 639, 652, 653, 655, 
656, 657, 670, 674, 693, 709, 710, 711, 712, 720, 728, 757, 771, 780, 797, 836, 
853, 870, 871, 882, 883, 884, 905, 906, 912, 922, 927, 936, 942, 948, 970, 971, 
972, 973, 974, 984, 993, 998, 1007, 1008, 1009, 1010]

I'll verify that using Open Babel's "pybel" interface for Python:

>>> from openbabel import pybel
>>> m = pybel.readstring("smi", 
>>> "O=C1N[C@H]2C[C@H](N(C2)Cc2ccncc2)C(=O)N2CCO[C@@H](C2)CN(C[C@H]2O[C@@H](C1)[C@H](O)[C@@H]2O)C(=O)C1CC1")
>>> m.calcfp("FP2").bits
[6, 10, 12, 14, 24, 47, 48, 50, 62, 65, 69, 82, 83, 105, 106, 124, 129, 134, 
157, 160, 184, 223, 225, 235, 239, 253, 254, 260, 261, 277, 282, 284, 289, 290, 
296, 312, 313, 314, 315, 317, 318, 325, 326, 330, 331, 336, 339, 340, 341, 348, 
349, 350, 358, 383, 385, 394, 430, 438, 439, 440, 441, 474, 484, 502, 509, 515, 
516, 521, 528, 529, 530, 531, 532, 548, 555, 557, 564, 565, 566, 576, 577, 590, 
596, 600, 601, 610, 612, 613, 614, 620, 624, 634, 635, 637, 640, 653, 654, 656, 
657, 658, 671, 675, 694, 710, 711, 712, 713, 721, 729, 758, 772, 781, 798, 837, 
854, 871, 872, 883, 884, 885, 906, 907, 913, 923, 928, 937, 943, 949, 971, 972, 
973, 974, 975, 985, 994, 999, 1008, 1009, 1010, 1011]

(Pybel fingerprint indices start from 1, which why there's a difference of one 
in the output.)

With this knowledge you may be able to do the Tanimoto calculation all in 
MongoDB, for example, by converting the "1" bit positions to an array field and 
using the aggregation framework.

(These are magic words found by reading 
https://stackoverflow.com/questions/27805634/can-i-calculate-the-similarity-of-document-fields-using-mapreduce
 ; note that "Tanimoto similarity" is the domain-specific term for what the IR 
field generally refers to as "Jaccard similarity").

There are many different fingerprint types. The ECFP fingerprints in more 
recent releases of Open Babel may be more relevant than the Daylight-like FP2 
fingerprints. This will require talking to your user base.

Best regards,

                                Andrew
                                da...@dalkescientific.com




_______________________________________________
OpenBabel-discuss mailing list
OpenBabel-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss

Reply via email to