Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-12-02 Thread Thomas Evangelidis
Dne po 2. 12. 2019 4:45 PM uživatel Greg Landrum 
napsal:

> [Adding the mailing list back on]
>

Oops, sorry about that.


> But if you add partial charges (a floating point number) then essentially
> every atom is going to end up with its own invariant. That's unlikely to
> end well.
>
>

I discretize them first. I do the same for every atomic property expressed
as a continuous variable.


>
> I'm going to guess, and without info on which molecules are generating
> different numbers of bits that's all I can do, that this is a result of the
> different hashing schemes for the atom invariants. If you really want to
> track down what's going on, you'll have to figure out which molecules are
> different and share those.
>
>
I will get back to this thread in due time with more information about
these molecules that cause discrepancies.

~Thomas
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-12-02 Thread Greg Landrum
[Adding the mailing list back on]

On Mon, Dec 2, 2019 at 3:23 PM Thomas Evangelidis  wrote:

>
> Thank you for the corrections and the explanation! As stated in my
> original email, I want to add extra atomic properties, like the partial
> charges, as atom invariants and assess whether they are beneficial in terms
> of performance.
>

But if you add partial charges (a floating point number) then essentially
every atom is going to end up with its own invariant. That's unlikely to
end well.


> Therefore, I wanted first to re-implement the standard invariants in
> Python in order to have a reference point for comparison later.
>

Ok, that makes sense.


> Your corrections improved the overall agreement, but as you pointed out,
> it is not complete. Although with the two example molecules the
> substructures and hence the number of 'on' bits are the same, on a large
> scale, namely generating fingerprint to train an ML model, the number of
> invariant bits is 2166 using the original ECFP atom invariants, while with
> user-defined invariants are 2171. In terms of performance, the original is
> marginally better.
>

I'm going to guess, and without info on which molecules are generating
different numbers of bits that's all I can do, that this is a result of the
different hashing schemes for the atom invariants. If you really want to
track down what's going on, you'll have to figure out which molecules are
different and share those.


> Btw, why did you use
>
> if(ring_info.NumAtomRings(i)):
> descriptors.append(1)
>
> and not
>
> descriptors.append(a.IsInRing())
>
> ? Your 2 lines do not add any value to the 'descriptors' list if the atom
> does not belong to a ring. Is this how it is in the original implementation?
>

Yeah, I just copied what's in the original implementation. The results
should be exactly the same.

-greg

>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-12-02 Thread Greg Landrum
Hi Thomas,

I think you are making your life more complicated than it needs to be while
testing things. You're certainly making it harder for us to follow what
you're doing. Please try and keep questions as simple as possible. In the
case below, the calls to getNumpyArray() aren't relevant to your question
and just make things confusing for those of us who want to help.

What exactly are you trying to figure out here? Do you for some reason want
to re-implement the standard connectivity invariants yourself in Python? I
doubt it, but in case you did want to do this, you've made a mistake in
your implementation. The invariants use whether or not an atom is in a
ring, not the number of rings it's in. You're also not using the periodic
table object efficiently. Here's an updated version:

def generateECFPAtomInvariant(mol, discrete_charges=False):
pt = Chem.GetPeriodicTable()
num_atoms = mol.GetNumAtoms()
invariants = [0]*num_atoms
ring_info = mol.GetRingInfo()
for i,a in enumerate(mol.GetAtoms()):
descriptors=[]
descriptors.append(a.GetAtomicNum())
descriptors.append(a.GetTotalDegree())
descriptors.append(a.GetTotalNumHs())
descriptors.append(a.GetFormalCharge())
descriptors.append(a.GetMass() - pt.GetAtomicWeight(a.GetSymbol()))
if(ring_info.NumAtomRings(i)):
descriptors.append(1)
invariants[i]=hash(tuple(descriptors))& 0x
return invariants

To compare things I would keep it as simple as possible and do something
like this:

for SMILES in ['Cc1n1',
'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H
](NCc3c2)C(=O)N2C2)c1']:
mol = Chem.MolFromSmiles(SMILES)
mol = Chem.AddHs(mol)
invariants = generateECFPAtomInvariant(mol)
bi1={}
bi2={}
fp1 = rdMolDescriptors.GetMorganFingerprint(mol,radius=3,bitInfo=bi1)
fp2 =
rdMolDescriptors.GetMorganFingerprint(mol,radius=3,invariants=invariants,bitInfo=bi2)
print("")
print(SMILES)
nz1 = fp1.GetNonzeroElements()
nz2 = fp2.GetNonzeroElements()
print(len(nz1),len(nz1))
print(nz1==nz2)


The first print function outputs the same result for both fingerprints, as
you'd hope.[1]
This, as you point out, will not generate the same bits since the
invariants are hashed to different values, so the last print function
outputs "False". There's also no good way to compare the two bitinfo
structures to each other: since the bit IDs are different, you have no way
of knowing which entries to compare to which. So, though each of the two
dictionaries should contain the same values (not keys), they will be in a
different order. If you want to compare just that the two sets of values is
the same (which isn't a lot more informative than comparing the number of
set bits), you could do:
print(sorted(bi1.values())==sorted(bi2.values()))

Again, I don't think you actually want to do any of this. What exactly are
you trying to accomplish?

-greg




-greg
[1] This isn't strictly guaranteed. Due to true hash collisions while
generating the fingerprint, there is a chance that you'll get different
numbers of bits being set.

On Sun, Dec 1, 2019 at 5:41 PM Thomas Evangelidis  wrote:

> Hi Paolo,
>
> Many thanks for the detailed explanation! Standing by your statement "If
> the invariants are provided by the user, they will be used instead", I
> attempted to reproduce the default ECFP fingerprint for a small and a large
> molecule. Here is the code:
>
> import numpy as np
> from rdkit import DataStructs
> from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem
> from rdkit import Chem
>
> def getNumpyArray(fp):
> arr = np.zeros((1,), np.float32)
> DataStructs.ConvertToNumpyArray(fp, arr)
> return arr
>
> def generateECFPAtomInvariant(mol, discrete_charges=False):
> num_atoms = mol.GetNumAtoms()
> invariants = [0]*num_atoms
> ring_info = mol.GetRingInfo()
> for i,a in enumerate(mol.GetAtoms()):
> descriptors=[]
> descriptors.append(a.GetAtomicNum())
> descriptors.append(a.GetTotalDegree())
> descriptors.append(a.GetTotalNumHs())
> descriptors.append(a.GetFormalCharge())
> descriptors.append(a.GetMass() - 
> PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol()))
> descriptors.append(ring_info.NumAtomRings(i))
> invariants[i]=hash(tuple(descriptors))& 0x
> return invariants
>
> for SMILES in ['Cc1n1', 
> 'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2C2)c1']:
> mol = Chem.MolFromSmiles(SMILES)
> mol = Chem.AddHs(mol)
> invariants = generateECFPAtomInvariant(mol)
> info, infoi = {}, {}
> fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, 
> nBits=8192, invariants=[], bitInfo=info))
> fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius=3, 
> nBits=8192, invariants=invariants, bitInfo=infoi))
>
>  

Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-12-01 Thread Thomas Evangelidis
Hi Paolo,

Many thanks for the detailed explanation! Standing by your statement "If
the invariants are provided by the user, they will be used instead", I
attempted to reproduce the default ECFP fingerprint for a small and a large
molecule. Here is the code:

import numpy as np
from rdkit import DataStructs
from rdkit.Chem import PeriodicTable, GetPeriodicTable, AllChem
from rdkit import Chem

def getNumpyArray(fp):
arr = np.zeros((1,), np.float32)
DataStructs.ConvertToNumpyArray(fp, arr)
return arr

def generateECFPAtomInvariant(mol, discrete_charges=False):
num_atoms = mol.GetNumAtoms()
invariants = [0]*num_atoms
ring_info = mol.GetRingInfo()
for i,a in enumerate(mol.GetAtoms()):
descriptors=[]
descriptors.append(a.GetAtomicNum())
descriptors.append(a.GetTotalDegree())
descriptors.append(a.GetTotalNumHs())
descriptors.append(a.GetFormalCharge())
descriptors.append(a.GetMass() -
PeriodicTable.GetAtomicWeight(GetPeriodicTable(), a.GetSymbol()))
descriptors.append(ring_info.NumAtomRings(i))
invariants[i]=hash(tuple(descriptors))& 0x
return invariants

for SMILES in ['Cc1n1',
'CS(=O)(=O)N1CCc2c(C1)c(nn2CCCN1CCOCC1)c1ccc(Cl)c(C#Cc2ccc3C[C@H](NCc3c2)C(=O)N2C2)c1']:
mol = Chem.MolFromSmiles(SMILES)
mol = Chem.AddHs(mol)
invariants = generateECFPAtomInvariant(mol)
info, infoi = {}, {}
fp = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol,
radius=3, nBits=8192, invariants=[], bitInfo=info))
fpi = getNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol,
radius=3, nBits=8192, invariants=invariants, bitInfo=infoi))

print("Do the substructures extracted by default invariants and
user-defined invariants match?", set(info.values()) ==
set(infoi.values()))
print("Number of mis-matching bits between fp and fpi=",
fp.shape[0] - np.count_nonzero(np.equal(fp, fpi)))


To assess whether the fingerprints match, I compared the values of the
'bitInfo' dictionaries. The keys are hash codes, which of course do not
match, but the values are pairs of (atomID, radius), which should be the
same. As you will the (atomID, radius) pairs for the small molecule match *but
not for the large one*. I also compared the two bitstrings per se, but I
suppose, due to the usage if different (?) hash functions, the bits don't
match neither for the small nor for the large molecule. Moreover, when I
implement the 'generateECFPAtomInvariant()' function on a large scale,
namely generating fingerprint to train an ML model, the number of invariant
bits is 2360 using the default ECFP atom invariants, while with
user-defined invariants are much less (795) and the performance of the ML
model is significantly different. Could someone point out what I am doing
wrong?

~Thomas

-- 

==

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences , Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
, LinkedIn: Thomas Evangelidis


website: https://sites.google.com/site/thomasevangelidishomepage/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-11-27 Thread Shojiro Shibayama
Dear Thomas,

You can get the SMILES of substructures that are extracted via
`GetMorganFingerprint` function as follows. Then, you can append any labels
to the SMILES string but not real numbers.

```python
from rdkit import Chem
mol = Chem.MolFromSmiles('Cc1n1')
info = {}
AllChem.GetMorganFingerprint(mol, radius=2, bitInfo=info)
radius, atom_id = list(info.values())[0][0][::-1]
env = Chem.FindAtomEnvironmentOfRadiusN(mol, radius, atom_id)
sub_struct = Chem.PathToSubmol(mol, env)
type(sub_struct) #=> rdkit.Chem.rdchem.Mol
Chem.MolToSmiles(sub_struct) #=>  'ccc'
```

Best,

On Fri, 22 Nov 2019 at 23:40, Thomas Evangelidis  wrote:

> Greetings,
>
> Could someone please clarify how can I pass atomic partial charges to the
> ECFP fingerprint generator along with the default atomic properties that it
> considers? Can I pass the real charge values or do I have to group them
> into bins and pass the bin identifier? I found a function in utilsFP.py
> file which generates invariants as follows:
>
> def generateAtomInvariant(mol):
> """
> >>> generateAtomInvariant(Chem.MolFromSmiles("Cc1n1"))
> [341294046, 3184205312, 522345510, 1545984525, 1545984525, 1545984525, 
> 1545984525]
> """
> num_atoms = mol.GetNumAtoms()
> invariants = [0]*num_atoms
> for i,a in enumerate(mol.GetAtoms()):
> descriptors=[]
> descriptors.append(a.GetAtomicNum())
> descriptors.append(a.GetTotalDegree())
> descriptors.append(a.GetTotalNumHs())
> descriptors.append(a.IsInRing())
> descriptors.append(a.GetIsAromatic())
> invariants[i]=hash(tuple(descriptors))& 0x
> return invariants
>
>
> And then generate the fingerprint like this:
>
>
> fp = AllChem.GetMorganFingerprint(mol, radius=3, 
> invariants=generateAtomInvariant(mol))
>
>
> Would just suffice to add this extra line in generateAtomInvariant() function?
>
>
> descriptors.append(a.GetFormalCharge())
>
>
>
> I thank you in advance.
> Thomas
>
>
>
> --
>
> ==
>
> Dr. Thomas Evangelidis
>
> Research Scientist
>
> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
> Academy of Sciences , 
> Prague,
> Czech Republic
>   &
> CEITEC - Central European Institute of Technology 
> , Brno, Czech Republic
>
> email: teva...@gmail.com, Twitter: tevangelidis
> , LinkedIn: Thomas Evangelidis
> 
>
> website: https://sites.google.com/site/thomasevangelidishomepage/
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 

The University of Tokyo
2nd year Ph.D. candidate
  Shojiro Shibayama

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How to pass atomic charge as atom invariant to ECFP?

2019-11-27 Thread Thomas Evangelidis
This is my 3rd attempt to get an explanation about how these invariants
work in the ECFP fingerprint cause I can't find it anywhere in the
documentation.
I tried the generateAtomInvariant() [see below] and the resulting ECFP
bit-vectors had for the same molecules drastically reduced variance, 2360
variant bits without invariants versus 795 with the invariants.
Surprisingly, the performance of the ECFP with invariants was better in
this dataset in terms of affinity ranking. Can someone please explain what
happens when I pass invariants to the AllChem.GetMorganFingerprint()
function??? I hope that I will get an answer this time.


>> def generateAtomInvariant(mol):
>> """
>> >>> generateAtomInvariant(Chem.MolFromSmiles("Cc1n1"))
>> [341294046, 3184205312, 522345510, 1545984525, 1545984525, 1545984525, 
>> 1545984525]
>> """
>> num_atoms = mol.GetNumAtoms()
>> invariants = [0]*num_atoms
>> for i,a in enumerate(mol.GetAtoms()):
>> descriptors=[]
>> descriptors.append(a.GetAtomicNum())
>> descriptors.append(a.GetTotalDegree())
>> descriptors.append(a.GetTotalNumHs())
>> descriptors.append(a.IsInRing())
>> descriptors.append(a.GetIsAromatic())
>> invariants[i]=hash(tuple(descriptors))& 0x
>> return invariants
>>
>>
>> And then generate the fingerprint like this:
>>
>>
>> fp = AllChem.GetMorganFingerprint(mol, radius=3, 
>> invariants=generateAtomInvariant(mol))
>>
>>
>>

-- 

==

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences , Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
, LinkedIn: Thomas Evangelidis


website: https://sites.google.com/site/thomasevangelidishomepage/
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss