Hi Thomas,
I didn't know how invariants are computed off the top of my head, so I
just looked into the code.
If no invariants are provided by the user, the invariants are computed
by the getConnectivityInvariants() function (FingerprintUtil.cpp):
void getConnectivityInvariants(const ROMol &mol, std::vector<uint32_t>
&invars,
bool includeRingMembership) {
unsigned int nAtoms = mol.getNumAtoms();
PRECONDITION(invars.size() >= nAtoms, "vector too small");
gboost::hash<std::vector<uint32_t>> vectHasher;
for (unsigned int i = 0; i < nAtoms; ++i) {
Atom const *atom = mol.getAtomWithIdx(i);
std::vector<uint32_t> components;
components.push_back(atom->getAtomicNum());
components.push_back(atom->getTotalDegree());
components.push_back(atom->getTotalNumHs());
components.push_back(atom->getFormalCharge());
int deltaMass = static_cast<int>(
atom->getMass() -
PeriodicTable::getTable()->getAtomicWeight(atom->getAtomicNum()));
components.push_back(deltaMass);
if (includeRingMembership &&
atom->getOwningMol().getRingInfo()->numAtomRings(atom->getIdx())) {
components.push_back(1);
}
invars[i] = vectHasher(components);
}
}
For each atom, tha invariant has 5 components:
* atom number
* total degree
* total number of hydrogens
* formal charge
* mass difference between the mass of your isotope and the average
atomic weight of the element (this will be 0 if you are not using a
specific isotope)
On top of these 5 components, if includeRingMembership is true (the
default) and the atom is part of one or more rings, there will be a 6th
component equal to the number of rings that the atom belongs to.
This 5 (or 6)-component vector is then hashed.
If the invariants are provided by the user, they will be used instead.
Cheers,
p.
On 27/11/2019 15:30, Thomas Evangelidis wrote:
This is my 3rd attempt to get an explanation about how these
invariants work in the ECFP fingerprint cause I can't find it anywhere
in the documentation.
I tried the generateAtomInvariant() [see below] and the resulting ECFP
bit-vectors had for the same molecules drastically reduced
variance, 2360 variant bits without invariants versus 795 with the
invariants. Surprisingly, the performance of the ECFP with invariants
was better in this dataset in terms of affinity ranking. Can someone
please explain what happens when I pass invariants to the
AllChem.GetMorganFingerprint() function??? I hope that I will get an
answer this time.
def generateAtomInvariant(mol):
""" >>> generateAtomInvariant(Chem.MolFromSmiles("Cc1ncccc1"))
[341294046, 3184205312, 522345510, 1545984525, 1545984525,
1545984525, 1545984525] """ num_atoms = mol.GetNumAtoms()
invariants = [0]*num_atoms
for i,ain enumerate(mol.GetAtoms()):
descriptors=[]
descriptors.append(a.GetAtomicNum())
descriptors.append(a.GetTotalDegree())
descriptors.append(a.GetTotalNumHs())
descriptors.append(a.IsInRing())
descriptors.append(a.GetIsAromatic())
invariants[i]=hash(tuple(descriptors))&0xffffffff return
invariants
And then generate the fingerprint like this:
fp = AllChem.GetMorganFingerprint(mol, radius=3,
invariants=generateAtomInvariant(mol))
--
======================================================================
Dr. Thomas Evangelidis
Research Scientist
IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
Academy of Sciences
<https://www.uochb.cz/web/structure/31.html?lang=en>, Prague, Czech
Republic
&
CEITEC - Central European Institute of Technology
<https://www.ceitec.eu/>, Brno, Czech Republic
email: teva...@gmail.com <mailto:teva...@gmail.com>, Twitter:
tevangelidis <https://twitter.com/tevangelidis>, LinkedIn: Thomas
Evangelidis <https://www.linkedin.com/in/thomas-evangelidis-495b45125/>
website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss