Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered aromatics

2010-07-20 Thread James Davidson
Dear All,
 
It's been a couple of weeks since Greg first helped me with this, and
after some further help I agreed that I would do my best to summarise
things for the benefit of the Group.
 
The attached file 'sanifix3.py' was provided to me by Greg, and
essentially does exactly what I (thought I) wanted - ie if required,
'cleans' up an input molecule by modifying aromatic nitrogen-containing
ring systems until a 'sanitizable' form is generated.
 
However, having tested this a bit further, I found that N-containing
heteroaromatics (which I originally posted the question about) are only
one of many possible issues when dealing with automated atom- and
bond-typing from PDB files!  So taking this approach would require a
significantly larger set of 'rules' to cover all possible problems (I'm
sure many people more experienced than me will have been aware of this
for a long time!).  As Greg said:
 
 Figuring out the correct chemistry for a pdb ligand is one of 
 those challenges at I wouldn't dream of attempting. Between 
 the various sources of ligand structures out there you can 
 probably find omsething at least halfway acceptable. For in 
 house stuff, I would assume that you can use the registry 
 number to get a smiles or mol block, right?
 You could use that with the rdkit substructure matching code 
 to test the pymol-assigned structures.

And indeed, this is the way that I ended-up going for in-house
structures - a script that extracts our corporate ID from the PDB file
and searches our database to return the SMILES.  Then (again, thanks to
Greg for more help here, and steering me away from some clumsy usage of
ConstrainedEmbed!) a substructure match is conducted between an RDKit
mol from the SMILES (refered to as 'db_mol' in the function below), and
the original ligand.

The main point here is to convert the original ligand structure to a set
of non-aromatic atoms joined by 'unspecified' bond-types.  Below is the
excerpt from what I am using with PyMOL: 'molfile3D' is a temporary
molfile that has been created using the PyMOL 'save' command, that gets
converted to the required 'connectivity substructure' that carries the
3D coordinates we will need later:


def make3DTemplate(molfile3D):

mol = Chem.MolFromMolFile(molfile3D, False)
for atom in mol.GetAtoms():
atom.SetIsAromatic(False)
for bond in mol.GetBonds():
bond.SetBondType(rdkit.Chem.rdchem.BondType.UNSPECIFIED)

return mol


Then once we have this '3D template', the substructure match can be
conducted for the molecule built from the database SMILES string
(db_mol).  If the match is successful, the original 3D coordinates for
the atoms in the 'template' are then applied back to a conformer of our
new molecule.  Finally, this new molecule + conformation is returned as
the molblock, which I then read back in PyMOL to give a 'sanitized'
version of the bound ligand for any in-house crystal structure:


def outputMolBlock(db_mol, template_mol):

matches = db_mol.GetSubstructMatches(template_mol)
if not matches:
raise ValueError,no substruct match
if len(matches)1:
print warning! more than one isomorphism found!

db_conf = db_mol.GetConformer()
template_conf = template_mol.GetConformer()

match = matches[0]

# This sets the 3D coordinates for 
for i,mIdx in enumerate(match):
db_conf.SetAtomPosition(mIdx,
template_conf.GetAtomPosition(i))

db_conf.Set3D(True)

return Chem.MolToMolBlock(db_mol)


It wouldn't now be too much of a leap(?) to extend the same methodology
to public PDB structures - using the LigandExpo SDF.  See this post from
Noel on Blue Obelisk for background:

http://blueobelisk.shapado.com/questions/how-to-get-an-experimental-liga
nd-structure-from-the-pdb


Also, just for interest - I am using cx_Oracle to connect to our
corporate database from Python, which is now allowing me to add a few
extra bits - like flagging up to people if the in-house structure they
have just opened has been previously crystallised in any other targets,
etc, etc.  If anybody is trying to do similar, but has not used
cx_Oracle, then give me a shout and I will see if I can help (although
SQL is definitely also on the list of things I know only barely enough
about!).

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for 

Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered aromatics

2010-07-20 Thread nikolaus . stiefl
Thanks James,

that is really nice ... I was doing something similar but never got to the 
point that it was working in rdkit.

Maybe a word of warning for the pdb structures - I looked at the 
components.cif file which contains OEChem and Cactus structures and some 
(not a small nnumber) are actually incorrect. I am not sure about the 
ligandexpo things but I guess it will be similar. If you have some look 
through it it would be interesting to get some insight there as well ;-)

Cheers
Nik




James Davidson j.david...@vernalis.com 
07/20/2010 08:57 PM

To
rdkit-discuss@lists.sourceforge.net
cc

Subject
Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered 
aromatics






Dear All,
 
It's been a couple of weeks since Greg first helped me with this, and
after some further help I agreed that I would do my best to summarise
things for the benefit of the Group.
 
The attached file 'sanifix3.py' was provided to me by Greg, and
essentially does exactly what I (thought I) wanted - ie if required,
'cleans' up an input molecule by modifying aromatic nitrogen-containing
ring systems until a 'sanitizable' form is generated.
 
However, having tested this a bit further, I found that N-containing
heteroaromatics (which I originally posted the question about) are only
one of many possible issues when dealing with automated atom- and
bond-typing from PDB files!  So taking this approach would require a
significantly larger set of 'rules' to cover all possible problems (I'm
sure many people more experienced than me will have been aware of this
for a long time!).  As Greg said:
 
 Figuring out the correct chemistry for a pdb ligand is one of 
 those challenges at I wouldn't dream of attempting. Between 
 the various sources of ligand structures out there you can 
 probably find omsething at least halfway acceptable. For in 
 house stuff, I would assume that you can use the registry 
 number to get a smiles or mol block, right?
 You could use that with the rdkit substructure matching code 
 to test the pymol-assigned structures.

And indeed, this is the way that I ended-up going for in-house
structures - a script that extracts our corporate ID from the PDB file
and searches our database to return the SMILES.  Then (again, thanks to
Greg for more help here, and steering me away from some clumsy usage of
ConstrainedEmbed!) a substructure match is conducted between an RDKit
mol from the SMILES (refered to as 'db_mol' in the function below), and
the original ligand.

The main point here is to convert the original ligand structure to a set
of non-aromatic atoms joined by 'unspecified' bond-types.  Below is the
excerpt from what I am using with PyMOL: 'molfile3D' is a temporary
molfile that has been created using the PyMOL 'save' command, that gets
converted to the required 'connectivity substructure' that carries the
3D coordinates we will need later:


def make3DTemplate(molfile3D):

 mol = Chem.MolFromMolFile(molfile3D, False)
 for atom in mol.GetAtoms():
 atom.SetIsAromatic(False)
 for bond in mol.GetBonds():
 bond.SetBondType(rdkit.Chem.rdchem.BondType.UNSPECIFIED)

 return mol


Then once we have this '3D template', the substructure match can be
conducted for the molecule built from the database SMILES string
(db_mol).  If the match is successful, the original 3D coordinates for
the atoms in the 'template' are then applied back to a conformer of our
new molecule.  Finally, this new molecule + conformation is returned as
the molblock, which I then read back in PyMOL to give a 'sanitized'
version of the bound ligand for any in-house crystal structure:


def outputMolBlock(db_mol, template_mol):

 matches = db_mol.GetSubstructMatches(template_mol)
 if not matches:
 raise ValueError,no substruct match
 if len(matches)1:
 print warning! more than one isomorphism 
found!

 db_conf = db_mol.GetConformer()
 template_conf = template_mol.GetConformer()

 match = matches[0]

 # This sets the 3D coordinates for 
 for i,mIdx in enumerate(match):
 db_conf.SetAtomPosition(mIdx,
template_conf.GetAtomPosition(i))
 
 db_conf.Set3D(True)

 return Chem.MolToMolBlock(db_mol)


It wouldn't now be too much of a leap(?) to extend the same methodology
to public PDB structures - using the LigandExpo SDF.  See this post from
Noel on Blue Obelisk for background:

http://blueobelisk.shapado.com/questions/how-to-get-an-experimental-liga
nd-structure-from-the-pdb


Also, just for interest - I am using cx_Oracle to connect to our
corporate database from Python, which is now allowing me to add a few
extra bits - like flagging up to people if the in-house structure they
have just opened has been 

Re: [Rdkit-discuss] ring in RDKIT many question

2010-07-20 Thread Greg Landrum
Hi Cedric,

Nik already made a couple of good points; I just have one thing to
add: it's relatively easy with the RDKit to produce an unhappy (in
the sense that it cannot be sanitized) molecule by the process of
removing non-ring atoms. This makes what you're doing more difficult.
This is a particular problem with aromatic nitrogens.

As an example:

[11] m = Chem.MolFromSmiles('c1cccn1C')
[12] nm = Chem.DeleteSubstructs(m,Chem.MolFromSmarts('[!r]'))
[13] smi = Chem.MolToSmiles(m)
[14] smi = Chem.MolToSmiles(nm)
[15] smi
Out[15] 'c1ccnc1'
[16] m2 = Chem.MolFromSmiles(smi)
[06:51:45] Can't kekulize mol
[17] mb = Chem.MolToMolBlock(nm)
[06:51:55] Can't kekulize mol

---
ValueErrorTraceback (most recent call last)

/home/glandrum/RDKit/ipython console in module()

ValueError: Sanitization error: Can't kekulize mol

#

this may account for some of the problems you are seeing.

Note that the recent post from James Davidson had an example
attachment that tries to fix this type of problem; you might want
either re-think how you are removing the non-ring atoms or work with
the script in James' email to cleanup the problems you are
encountering.

Best Regards,
-greg

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss