Hi Paolo,

Many thanks for the detailed explanation! Standing by your statement "If
the invariants are provided by the user, they will be used instead", I
attempted to reproduce the default ECFP fingerprint for a small and a large
molecule. Here is the code:

import numpy as np
from rdkit import DataStructs
from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem
from rdkit import Chem

def getNumpyArray(fp):
    arr = np.zeros((1,), np.float32)
    DataStructs.ConvertToNumpyArray(fp, arr)
    return arr

def generateECFPAtomInvariant(mol, discrete_charges=False):
    num_atoms = mol.GetNumAtoms()
    invariants = [0]*num_atoms
    ring_info = mol.GetRingInfo()
    for i,a in enumerate(mol.GetAtoms()):
        descriptors=[]
        descriptors.append(a.GetAtomicNum())
        descriptors.append(a.GetTotalDegree())
        descriptors.append(a.GetTotalNumHs())
        descriptors.append(a.GetFormalCharge())
        descriptors.append(a.GetMass() -
PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol()))
        descriptors.append(ring_info.NumAtomRings(i))
        invariants[i]=hash(tuple(descriptors))& 0xffffffff
    return invariants

for SMILES in ['Cc1ncccc1',
'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2CCCCC2)c1']:
    mol = Chem.MolFromSmiles(SMILES)
    mol = Chem.AddHs(mol)
    invariants = generateECFPAtomInvariant(mol)
    info, infoi = {}, {}
    fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol,
radius=3, nBits=8192, invariants=[], bitInfo=info))
    fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol,
radius=3, nBits=8192, invariants=invariants, bitInfo=infoi))

    print("Do the substructures extracted by default invariants and
user-defined invariants match?", set(info.values()) ==
set(infoi.values()))
    print("Number of mis-matching bits between fp and fpi=",
fp.shape[0] - np.count_nonzero(np.equal(fp, fpi)))


To assess whether the fingerprints match, I compared the values of the
'bitInfo' dictionaries. The keys are hash codes, which of course do not
match, but the values are pairs of (atomID, radius), which should be the
same. As you will the (atomID, radius) pairs for the small molecule match *but
not for the large one*. I also compared the two bitstrings per se, but I
suppose, due to the usage if different (?) hash functions, the bits don't
match neither for the small nor for the large molecule. Moreover, when I
implement the 'generateECFPAtomInvariant()' function on a large scale,
namely generating fingerprint to train an ML model, the number of invariant
bits is 2360 using the default ECFP atom invariants, while with
user-defined invariants are much less (795) and the performance of the ML
model is significantly different. Could someone point out what I am doing
wrong?

~Thomas

-- 

======================================================================

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
<https://www.ceitec.eu/>, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
<https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
<https://www.linkedin.com/in/thomas-evangelidis-495b45125/>

website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to