I'm trying to understand how to use the ECFP fingerprints added in Open Babel
2.4.
They generate fingerprints where the length is a function of the number of
heavy atoms. I want a fixed length binary fingerprint so I can compare two
fingerprints using the normal binary Tanimoto. (This is for chemfp.) It doesn't
seem possible. It seems like what's missing is something like a hash method to
turn these per-atom values into something that fits in with how the other
fingerprints work.
Here's an example which shows how the length varies:
>>> import openbabel as ob
>>> import pybel
>>>
>>> mol1 = pybel.readstring("smi", "C").OBMol
>>> mol3 = pybel.readstring("smi", "CCC").OBMol
>>> mol9 = pybel.readstring("smi", "C"*9).OBMol
>>> mol_info = [("mol1", mol1), ("mol3", mol3), ("mol9", mol9)]
>>>
>>> fptype = ob.OBFingerprint.FindFingerprint("ECFP0")
>>>
>>> # Show that the size is a function of length
... for name, mol in mol_info:
... fp = ob.vectorUnsignedInt()
... fptype.GetFingerprint(mol, fp)
... print("%s: %r" % (name, list(fp)))
...
True
mol1: [1526808443]
True
mol3: [3405580958, 3405580958, 3756301279]
True
mol9: [3405580958, 3405580958, 3756301279, 3756301279, 3756301279, 3756301279,
3756301279, 3756301279, 3756301279]
I understand what it's showing. Each heavy atom has its own value in the list,
and the list is sorted to give a canonical ordering.
More specifically, the [CH4] generates the characteristic value "1526808443",
the two "[CH3]-" generates the characteristic value "3405580958", and the
"-[CH2]-" generates the characteristic value 3756301279.
However, the non-ECFP fingerprints all generate a constant size, and parts of
Open Babel will break with a variable length size.
For example, the fast search indexing in fingerprint.cpp:322
FastSearchIndex::Add() assumes the fingerprint vector returned from
GetFingerprint() will be constant, where headwords = vectors.size(). If you try
to generate a .fs file using "-ofs -xfECFP0" then it will work, but the
similarity search will fail with "Difficulty reading from index".
Is there an Open Babel function to compare two of these variable-length
fingerprints? It looks like a count-based Tanimoto is needed, so mol3 and mol9
have a similarity of (2+1)/(2+7) = 3/9 = 1/3.
Is there any way to turn this into a useful fixed-length fingerprint? I tried
to generate FPS output using "-ofps -xfECFP0" but the fingerprint content was
empty.
I could zero-pad small fingerprints, but it's not really possible to compare,
say, "C", "O", and "CO" as the corresponding values of [X, 0], [Y, 0], and
either [X, Y] or [Y, X] won't give the right comparison scores.
The current folding method also isn't really useful for larger fingerprints.
There's an nBits parameter of GetFingerprint():
... for name, mol in mol_info:
... fp = ob.vectorUnsignedInt()
... fptype.GetFingerprint(mol, fp, 128)
... print("%s (fold 128): %r" % (name, list(fp)))
...
True
mol1 (fold 128): [1526808443]
True
mol3 (fold 128): [3405580958, 3405580958, 3756301279]
True
mol9 (fold 128): [3757939679, 3757939679, 3756301279, 3756301279]
The underlying code in fingerecfp.cpp implements this by calling Fold().
However, if you want to be able to compare the post-folded fingerprints, then
this only works if the initial positions are globally invariant for the given
characteristic. But in the ECFP case the initial position depends on the other
features in the molecule, because the fingerprints are sorted.
(Also, there's a bug where the nBits doesn't work until the number of bits is
at least twice as long as that value:
>>> for i in range(4, 10):
... mol = pybel.readstring("smi", "C"*i).OBMol
... fp = ob.vectorUnsignedInt()
... fptype.GetFingerprint(mol, fp, 128)
... print("%s: %r" % (i, list(fp)))
...
True
4: [4293393407, 3757939679, 4026359518, 4286507510]
True
5: [4293393407, 4294433791, 4026359518, 3942468319, 4293835775]
True
6: [4293393407, 4294433791, 4294433791, 3942472702, 4026359518, 4293835775]
True
7: [4293393407, 4294433791, 4294433791, 4294433791, 4026359518, 3942468319,
4293835775]
True
8: [4294441983, 4294959103, 4294967295, 4294966271]
True
9: [4294441983, 4294433791, 4294959103, 4294958079]
)
To make a long email short, it feels like there should be an entirely different
function than folding to turn these list of per-atom ECFP values into the type
of fingerprint that the rest of Open Babel (and of chemfp) can use.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
OpenBabel-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openbabel-discuss