Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered aromatics
Dear All, It's been a couple of weeks since Greg first helped me with this, and after some further help I agreed that I would do my best to summarise things for the benefit of the Group. The attached file 'sanifix3.py' was provided to me by Greg, and essentially does exactly what I (thought I) wanted - ie if required, 'cleans' up an input molecule by modifying aromatic nitrogen-containing ring systems until a 'sanitizable' form is generated. However, having tested this a bit further, I found that N-containing heteroaromatics (which I originally posted the question about) are only one of many possible issues when dealing with automated atom- and bond-typing from PDB files! So taking this approach would require a significantly larger set of 'rules' to cover all possible problems (I'm sure many people more experienced than me will have been aware of this for a long time!). As Greg said: Figuring out the correct chemistry for a pdb ligand is one of those challenges at I wouldn't dream of attempting. Between the various sources of ligand structures out there you can probably find omsething at least halfway acceptable. For in house stuff, I would assume that you can use the registry number to get a smiles or mol block, right? You could use that with the rdkit substructure matching code to test the pymol-assigned structures. And indeed, this is the way that I ended-up going for in-house structures - a script that extracts our corporate ID from the PDB file and searches our database to return the SMILES. Then (again, thanks to Greg for more help here, and steering me away from some clumsy usage of ConstrainedEmbed!) a substructure match is conducted between an RDKit mol from the SMILES (refered to as 'db_mol' in the function below), and the original ligand. The main point here is to convert the original ligand structure to a set of non-aromatic atoms joined by 'unspecified' bond-types. Below is the excerpt from what I am using with PyMOL: 'molfile3D' is a temporary molfile that has been created using the PyMOL 'save' command, that gets converted to the required 'connectivity substructure' that carries the 3D coordinates we will need later: def make3DTemplate(molfile3D): mol = Chem.MolFromMolFile(molfile3D, False) for atom in mol.GetAtoms(): atom.SetIsAromatic(False) for bond in mol.GetBonds(): bond.SetBondType(rdkit.Chem.rdchem.BondType.UNSPECIFIED) return mol Then once we have this '3D template', the substructure match can be conducted for the molecule built from the database SMILES string (db_mol). If the match is successful, the original 3D coordinates for the atoms in the 'template' are then applied back to a conformer of our new molecule. Finally, this new molecule + conformation is returned as the molblock, which I then read back in PyMOL to give a 'sanitized' version of the bound ligand for any in-house crystal structure: def outputMolBlock(db_mol, template_mol): matches = db_mol.GetSubstructMatches(template_mol) if not matches: raise ValueError,no substruct match if len(matches)1: print warning! more than one isomorphism found! db_conf = db_mol.GetConformer() template_conf = template_mol.GetConformer() match = matches[0] # This sets the 3D coordinates for for i,mIdx in enumerate(match): db_conf.SetAtomPosition(mIdx, template_conf.GetAtomPosition(i)) db_conf.Set3D(True) return Chem.MolToMolBlock(db_mol) It wouldn't now be too much of a leap(?) to extend the same methodology to public PDB structures - using the LigandExpo SDF. See this post from Noel on Blue Obelisk for background: http://blueobelisk.shapado.com/questions/how-to-get-an-experimental-liga nd-structure-from-the-pdb Also, just for interest - I am using cx_Oracle to connect to our corporate database from Python, which is now allowing me to add a few extra bits - like flagging up to people if the in-house structure they have just opened has been previously crystallised in any other targets, etc, etc. If anybody is trying to do similar, but has not used cx_Oracle, then give me a shout and I will see if I can help (although SQL is definitely also on the list of things I know only barely enough about!). Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for
Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered aromatics
Thanks James, that is really nice ... I was doing something similar but never got to the point that it was working in rdkit. Maybe a word of warning for the pdb structures - I looked at the components.cif file which contains OEChem and Cactus structures and some (not a small nnumber) are actually incorrect. I am not sure about the ligandexpo things but I guess it will be similar. If you have some look through it it would be interesting to get some insight there as well ;-) Cheers Nik James Davidson j.david...@vernalis.com 07/20/2010 08:57 PM To rdkit-discuss@lists.sourceforge.net cc Subject Re: [Rdkit-discuss] Reading Molfiles with \'ambiguous\' 5-membered aromatics Dear All, It's been a couple of weeks since Greg first helped me with this, and after some further help I agreed that I would do my best to summarise things for the benefit of the Group. The attached file 'sanifix3.py' was provided to me by Greg, and essentially does exactly what I (thought I) wanted - ie if required, 'cleans' up an input molecule by modifying aromatic nitrogen-containing ring systems until a 'sanitizable' form is generated. However, having tested this a bit further, I found that N-containing heteroaromatics (which I originally posted the question about) are only one of many possible issues when dealing with automated atom- and bond-typing from PDB files! So taking this approach would require a significantly larger set of 'rules' to cover all possible problems (I'm sure many people more experienced than me will have been aware of this for a long time!). As Greg said: Figuring out the correct chemistry for a pdb ligand is one of those challenges at I wouldn't dream of attempting. Between the various sources of ligand structures out there you can probably find omsething at least halfway acceptable. For in house stuff, I would assume that you can use the registry number to get a smiles or mol block, right? You could use that with the rdkit substructure matching code to test the pymol-assigned structures. And indeed, this is the way that I ended-up going for in-house structures - a script that extracts our corporate ID from the PDB file and searches our database to return the SMILES. Then (again, thanks to Greg for more help here, and steering me away from some clumsy usage of ConstrainedEmbed!) a substructure match is conducted between an RDKit mol from the SMILES (refered to as 'db_mol' in the function below), and the original ligand. The main point here is to convert the original ligand structure to a set of non-aromatic atoms joined by 'unspecified' bond-types. Below is the excerpt from what I am using with PyMOL: 'molfile3D' is a temporary molfile that has been created using the PyMOL 'save' command, that gets converted to the required 'connectivity substructure' that carries the 3D coordinates we will need later: def make3DTemplate(molfile3D): mol = Chem.MolFromMolFile(molfile3D, False) for atom in mol.GetAtoms(): atom.SetIsAromatic(False) for bond in mol.GetBonds(): bond.SetBondType(rdkit.Chem.rdchem.BondType.UNSPECIFIED) return mol Then once we have this '3D template', the substructure match can be conducted for the molecule built from the database SMILES string (db_mol). If the match is successful, the original 3D coordinates for the atoms in the 'template' are then applied back to a conformer of our new molecule. Finally, this new molecule + conformation is returned as the molblock, which I then read back in PyMOL to give a 'sanitized' version of the bound ligand for any in-house crystal structure: def outputMolBlock(db_mol, template_mol): matches = db_mol.GetSubstructMatches(template_mol) if not matches: raise ValueError,no substruct match if len(matches)1: print warning! more than one isomorphism found! db_conf = db_mol.GetConformer() template_conf = template_mol.GetConformer() match = matches[0] # This sets the 3D coordinates for for i,mIdx in enumerate(match): db_conf.SetAtomPosition(mIdx, template_conf.GetAtomPosition(i)) db_conf.Set3D(True) return Chem.MolToMolBlock(db_mol) It wouldn't now be too much of a leap(?) to extend the same methodology to public PDB structures - using the LigandExpo SDF. See this post from Noel on Blue Obelisk for background: http://blueobelisk.shapado.com/questions/how-to-get-an-experimental-liga nd-structure-from-the-pdb Also, just for interest - I am using cx_Oracle to connect to our corporate database from Python, which is now allowing me to add a few extra bits - like flagging up to people if the in-house structure they have just opened has been
Re: [Rdkit-discuss] Reading Molfiles with 'ambiguous' 5-membered aromatics
Dear James, On Mon, Jul 5, 2010 at 12:31 PM, James Davidson j.david...@vernalis.com wrote: I have been exploring some interactivity between PyMOL and RDKit recently, and at the moment am ferrying molecules between the two in MOL format. However, I have come up against a bit of a problem that I wondered if anyone could help with? PyMOL does a pretty good job of setting bond valences automatically for ligands read-in as part of PDB files, and most of the time these ligands exported in MOL format are recognised fine by RDKit (a little bit of parsing is necessary to change things like 'Cl' being capitalised in some PDB files, etc!). However, it seems that when there is ambiguity about how to tautomerise 5-membered heteroaromatics, RDKit fails to create a mol object from the molfile. I have included an example Molfile below (a pyrazolopyrimidine). Ideally (from my point of view at least!) it would be great if in these situations RDKit yielded an arbitrary explicit H to 'mend' the problem. However, I am definitely open to workaround suggestions (including go post on the PyMOL lists :-) ). Maybe this is something that is relatively trivial to tackle using PyMOL's ChemPy module? (which I know very little about!). As long as it's really ok that the tautomer you get is arbitrary, the following code snippet might help : # def AdjustAromaticNs(m): matches = [x[0] for x in m.GetSubstructMatches(Chem.MolFromSmarts('n'))] foundOne=False for idx in matches: nm = Chem.Mol(m.ToBinary()) nm.GetAtomWithIdx(idx).SetNoImplicit(True) nm.GetAtomWithIdx(idx).SetNumExplicitHs(1) try: Chem.SanitizeMol(nm) except: continue else: foundOne=True break if foundOne: return nm else: return None # Use it like this: #--- m = Chem.MolFromMolBlock(mb,False) try: Chem.SanitizeMol(m) except ValueError: nm=AdjustAromaticNs(m) if nm is not None: print Chem.MolToSmiles(nm) #- One could imagine making this more efficient or adding heuristics to try and find the right answer more efficiently, but this ought to at least get you started. -greg -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss