Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand (Dimitri Maziuk)
I had to do something similar on 40,000 PDB files ending up using PDB_tools https://pypi.org/project/pdb-tools/ Cheers, Chris ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
Hi IIllimar, The RDKit PDB reader only recognize standard amino acids and, after the PR I did on Saturday https://github.com/rdkit/rdkit/pull/2850 will be merged, nucleic acid bases. Anything else will not have the correct hybridization/bond orders perceived, as those are not encoded in the PDB format and the PDB reader does not have any functionality to do that. The 1ARJ case is peculiar, as it has an ARG residue which would be recognized if it were in the ATOM records, but not in the HETATM section, for which no attempt to perceive the correct hybridization/bond is made. My suggestion, if you are using standard PDB files, is to download the SDF file: https://www.rcsb.org/pdb/download/downloadLigandFiles.do?ligandIdList=A2F&structIdList=3GOT&instanceType=all&excludeUnobserved=false&includeHydrogens=false and construct your RDKit molecule from that. You should be able to automate that without too much effort either constructing URLs using the template above or using the PDB REST API. Cheers, p. On 16/12/2019 18:24, Illimar Hugo Rekand wrote: Thanks, Paolo, for a good and clear example. I adapted your code into my workflow to calculate some Lipinski-properties of RNA pdb-structures, and ran into some issues. I'm not sure if I should make a new thread or throw this onto this one I already created? I used the following code under from rdkit import Chem from rdkit.Chem import rdmolops, Lipinski from urllib.request import urlopen import gzip import pprint pp = pprint.PrettyPrinter(indent=4) Lipinski_dic = {'FractionCSP3':Lipinski.FractionCSP3, 'HeavyAtomCount':Lipinski.HeavyAtomCount, 'NHOHCount': Lipinski.NHOHCount, "NOCount":Lipinski.NOCount, "NumAliphaticCarbocycles": Lipinski.NumAliphaticCarbocycles, "NumAliphaticHeterocycles" : Lipinski.NumAliphaticHeterocycles, 'NumAliphaticRings' : Lipinski.NumAliphaticRings, 'NumAromaticCarbocycles' : Lipinski.NumAromaticCarbocycles, 'NumAromaticHeterocycles' : Lipinski.NumAromaticHeterocycles, 'NumAromaticRings' : Lipinski.NumAromaticRings, 'NumHAcceptors' : Lipinski.NumHAcceptors, 'NumHDonors' : Lipinski.NumHDonors, 'NumHeteroatoms' : Lipinski.NumHeteroatoms, 'NumRotatableBonds' : Lipinski.NumRotatableBonds, 'NumSaturatedCarbocycles' : Lipinski.NumSaturatedCarbocycles, 'NumSaturatedHeterocycles' : Lipinski.NumSaturatedHeterocycles, 'NumSaturatedRings' : Lipinski.NumSaturatedRings, 'RingCount' : Lipinski.RingCount } url = "https://files.rcsb.org/download/1arj.pdb.gz"; pdb_data = gzip.decompress(urlopen(url).read()) mol = Chem.RWMol(Chem.MolFromPDBBlock(pdb_data)) bonds_to_cleave = {(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in mol.GetBonds() if b.GetBeginAtom().GetPDBResidueInfo().GetIsHeteroAtom() ^ b.GetEndAtom().GetPDBResidueInfo().GetIsHeteroAtom()} [mol.RemoveBond(*b) for b in bonds_to_cleave] hetatm_frags = [f for f in rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True) if f.GetNumAtoms() and f.GetAtomWithIdx(0).GetPDBResidueInfo().GetIsHeteroAtom()] for hetatm in hetatm_frags: res_name = hetatm.GetAtomWithIdx(0).GetPDBResidueInfo().GetResidueName() calculated_props = {} for prop in Lipinski_dic: function = Lipinski_dic[prop] x = function(hetatm) calculated_props[prop] = x pp.pprint(calculated_props) and as you can see the properties of the ligand doesn't match up with what is expected (The number of SP3-atoms doesn't match up). When parsing through the structure 3got, it fails to recognize the aromatic rings of the ligand A2F. I'm assuming this is caused by RDKit not assigning bond orders correctly when reading in RNA and DNA pdb files (something which I have reported in earlier on this mailing list)? Running hetatm.UpdatePropertyCache(strict=True) does not remedy this problem. Is there a clever way I can fix this quickly without waiting for this to be fixed in a future version? Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen From: Illimar Hugo Rekand Sent: Monday, December 16, 2019 5:55:56 PM To: Paolo Tosco Subject: Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand Hey, Paolo, thanks for a good and clear example! all the best, Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen From: Paolo Tosco Sent: Monday, December 16, 2019 5:52:18 PM To: Illimar Hugo Rekand; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand Hi Illimar, this gist: https://gist.github.
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
On 12/16/19 10:35 AM, Illimar Hugo Rekand wrote: > Fair point. > > But when working in the 100s and 1000s range of PDB-files it would be nice to > have some fewer steps when designing a pipeline. But what's the selection criteria? NMR structures are usually deposited with 20 models, do you want the ligand from every one? Only from the representative one? There's at least one PDB ID (forget which) with 3 stable conformers, i.e. model 1 is not the representative structure. Structures annotated by PDB will have HETATM instead of ATOM for non-standards and ligands, but if your files haven't been processed by them, all bets are off. And so on -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
Thanks, Paolo, for a good and clear example. I adapted your code into my workflow to calculate some Lipinski-properties of RNA pdb-structures, and ran into some issues. I'm not sure if I should make a new thread or throw this onto this one I already created? I used the following code under from rdkit import Chem from rdkit.Chem import rdmolops, Lipinski from urllib.request import urlopen import gzip import pprint pp = pprint.PrettyPrinter(indent=4) Lipinski_dic = {'FractionCSP3':Lipinski.FractionCSP3, 'HeavyAtomCount':Lipinski.HeavyAtomCount, 'NHOHCount': Lipinski.NHOHCount, "NOCount":Lipinski.NOCount, "NumAliphaticCarbocycles": Lipinski.NumAliphaticCarbocycles, "NumAliphaticHeterocycles" : Lipinski.NumAliphaticHeterocycles, 'NumAliphaticRings' : Lipinski.NumAliphaticRings, 'NumAromaticCarbocycles' : Lipinski.NumAromaticCarbocycles, 'NumAromaticHeterocycles' : Lipinski.NumAromaticHeterocycles, 'NumAromaticRings' : Lipinski.NumAromaticRings, 'NumHAcceptors' : Lipinski.NumHAcceptors, 'NumHDonors' : Lipinski.NumHDonors, 'NumHeteroatoms' : Lipinski.NumHeteroatoms, 'NumRotatableBonds' : Lipinski.NumRotatableBonds, 'NumSaturatedCarbocycles' : Lipinski.NumSaturatedCarbocycles, 'NumSaturatedHeterocycles' : Lipinski.NumSaturatedHeterocycles, 'NumSaturatedRings' : Lipinski.NumSaturatedRings, 'RingCount' : Lipinski.RingCount } url = "https://files.rcsb.org/download/1arj.pdb.gz"; pdb_data = gzip.decompress(urlopen(url).read()) mol = Chem.RWMol(Chem.MolFromPDBBlock(pdb_data)) bonds_to_cleave = {(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in mol.GetBonds() if b.GetBeginAtom().GetPDBResidueInfo().GetIsHeteroAtom() ^ b.GetEndAtom().GetPDBResidueInfo().GetIsHeteroAtom()} [mol.RemoveBond(*b) for b in bonds_to_cleave] hetatm_frags = [f for f in rdmolops.GetMolFrags(mol, asMols=True, sanitizeFrags=True) if f.GetNumAtoms() and f.GetAtomWithIdx(0).GetPDBResidueInfo().GetIsHeteroAtom()] for hetatm in hetatm_frags: res_name = hetatm.GetAtomWithIdx(0).GetPDBResidueInfo().GetResidueName() calculated_props = {} for prop in Lipinski_dic: function = Lipinski_dic[prop] x = function(hetatm) calculated_props[prop] = x pp.pprint(calculated_props) and as you can see the properties of the ligand doesn't match up with what is expected (The number of SP3-atoms doesn't match up). When parsing through the structure 3got, it fails to recognize the aromatic rings of the ligand A2F. I'm assuming this is caused by RDKit not assigning bond orders correctly when reading in RNA and DNA pdb files (something which I have reported in earlier on this mailing list)? Running hetatm.UpdatePropertyCache(strict=True) does not remedy this problem. Is there a clever way I can fix this quickly without waiting for this to be fixed in a future version? Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen From: Illimar Hugo Rekand Sent: Monday, December 16, 2019 5:55:56 PM To: Paolo Tosco Subject: Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand Hey, Paolo, thanks for a good and clear example! all the best, Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen From: Paolo Tosco Sent: Monday, December 16, 2019 5:52:18 PM To: Illimar Hugo Rekand; rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand Hi Illimar, this gist: https://gist.github.com/ptosco/2ee199b219b27e01052a7a1433b3bd22 shows a way to achieve this. Cheers, p. On 16/12/2019 16:07, Illimar Hugo Rekand wrote: > Hello, everyone > > > Is there a simple way to create a mol object from just the HETATM/ligand > lines from a pdb-file? > > Would it be viable to create a function where you could create a mol object > from specific lines within a pdb-file? > > > Illimar Rekand > Ph.D. candidate, > Brenk-lab, Haug-lab > Department of Biomedicine > Department of Chemistry > University of Bergen > > > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
Hi Illimar, this gist: https://gist.github.com/ptosco/2ee199b219b27e01052a7a1433b3bd22 shows a way to achieve this. Cheers, p. On 16/12/2019 16:07, Illimar Hugo Rekand wrote: Hello, everyone Is there a simple way to create a mol object from just the HETATM/ligand lines from a pdb-file? Would it be viable to create a function where you could create a mol object from specific lines within a pdb-file? Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
Fair point. But when working in the 100s and 1000s range of PDB-files it would be nice to have some fewer steps when designing a pipeline. Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen From: Dimitri Maziuk via Rdkit-discuss Sent: Monday, December 16, 2019 5:24:49 PM To: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand On 12/16/2019 10:07 AM, Illimar Hugo Rekand wrote: > Would it be viable to create a function where you could create a mol object > from specific lines within a pdb-file? PDB file is simple text. There's any number of utilities to extract the lines you want, incl. a plain text editor, why spend time on reinventing the wheel? Dima ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Constructing a mol object from a PDB ligand
On 12/16/2019 10:07 AM, Illimar Hugo Rekand wrote: Would it be viable to create a function where you could create a mol object from specific lines within a pdb-file? PDB file is simple text. There's any number of utilities to extract the lines you want, incl. a plain text editor, why spend time on reinventing the wheel? Dima ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines
Hi Thomas, First it is important to compare equivalent major versions to each other. Particularly in this case. On my linux box generating the pattern fingerprints takes 24.2 seconds with v2019.03.x and 15.9 seconds with v2019.09.x (that's due to the improvements in the substructure matcher that the blog post you link to discusses). Comparing the same versions to each other: Performance on windows vs linux Windows performance with the RDKit has always lagged behind linux performance. There's something in the code (or in the way we use the compiler) that leads to big differences on some benchmarks. The most straightforward way I can demonstrate this is with results from my windows 10 laptop. Here's the output when running the fingerprint_screenout.py benchmark using the windows build: | 2019.09.1 | 13.6 | 0.3 | 38.1 | 0.8 | 25.5 | 25.9 | 84.1 | and here's the output from a linux build running on the Windows Linux Subsystem: | 2019.09.2 | 10.7 | 0.2 | 19.3 | 0.4 | 19.4 | 19.2 | 53.2 | You can see the differences are not small. I haven't invested massive time into it, but I haven't been able to figure out what causes this. Performance on (linux) VMs I can't think of any particular reason why there should be huge differences and it's really difficult to compare apples to apples here. Since I have the numbers, here's one comparison Here's a run on my linux workstation: | 2019.09.2 | 7.6 | 0.3 | 15.9 | 0.4 | 21.4 | 20.4 | 55.7 | and here's the same thing on an AWS t3.xlarge instance: | 2019.09.2 | 9.6 | 0.2 | 20.3 | 0.4 | 38.4 | 38.2 | 94.7 | The VM is significantly slower, but t3.xlarge an instance type that's intended to be used for compute intensive jobs (I don't have on of those active and configured at the moment). Does that help at all? -greg On Mon, Dec 16, 2019 at 8:27 AM Thomas Strunz wrote: > Hi All, > > I was looking at a blog post from greg: > > > https://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html > > about fingerprint screenout. The part that got me confused was the timings > in his blog post because run times in my case where a lot slower. > > Gregs numbers: > > [07:21:19] INFO: mols from smiles > [07:21:27] INFO: Results1: 7.77 seconds, 5 mols > [07:21:27] INFO: queries from smiles > [07:21:27] INFO: Results2: 0.16 seconds*[07:21:27] INFO: generating pattern > fingerprints for mols > [07:21:43] INFO: Results3: 16.11 seconds* > [07:21:43] INFO: generating pattern fingerprints for queries > [07:21:43] INFO: Results4: 0.34 seconds > [07:21:43] INFO: testing frags queries > [07:22:03] INFO: Results5: 19.90 seconds. 6753 tested (0.0003 of total), > 3989 found, 0.59 accuracy. 0 errors. > [07:22:03] INFO: testing leads queries > [07:22:23] INFO: Results6: 19.77 seconds. 1586 tested (0.0001 of total), > 1067 found, 0.67 accuracy. 0 errors. > [07:22:23] INFO: testing pieces queries > [07:23:19] INFO: Results7: 55.37 seconds. 202 tested (0.0810 of total), > 1925628 found, 0.58 accuracy. 0 errors. > > | 2019.09.1dev1 | 7.8 | 0.2 | 16.1 | 0.3 | 19.9 | 19.8 | 55.4 | > > > > > *Machine 1:* > Virtual machine, Windows Server 2012 R2 with an intel xeon (4 virtual > cores) > > Since the test is single-threaded it makes a bit of sense that it isn't > fast here but it's not just a bit slower, but a lot slower, depending on > test almost 3xtimes slower > > [09:03:19] INFO: mols from smiles > [09:03:38] INFO: Results1: 19.44 seconds, 5 mols > [09:03:38] INFO: queries from smiles > [09:03:38] INFO: Results2: 0.36 seconds > > *[09:03:38] INFO: generating pattern fingerprints for mols * > *[09:04:54] INFO: Results3: 75.99 seconds* > [09:04:54] INFO: generating pattern fingerprints for queries > [09:04:56] INFO: Results4: 1.55 seconds > [09:04:56] INFO: testing frags queries > [09:05:34] INFO: Results5: 37.59 seconds. 6753 tested (0.0003 of total), > 3989 f > ound, 0.59 accuracy. 0 errors. > [09:05:34] INFO: testing leads queries > [09:06:11] INFO: Results6: 37.34 seconds. 1586 tested (0.0001 of total), > 1067 f > ound, 0.67 accuracy. 0 errors. > [09:06:11] INFO: testing pieces queries > [09:08:39] INFO: Results7: 147.79 seconds. 202 tested (0.0810 of > total), 19 > 25628 found, 0.58 accuracy. 0 errors. > | 2019.03.3 | 19.4 | 0.4 | 76.0 | 1.5 | 37.6 | 37.3 | 147.8 | > > I thought maybe another issue with windows being slow so I tested on a > linux VM on my laptop > > *Machine 2:* > Virtual machine, Lubuntu 16.04 on a laptop i7-8850H 6-core > > [09:23:31] INFO: mols from smiles > [09:23:54] INFO: Results1: 23.71 seconds, 5 mols > [09:23:54] INFO: queries from smiles > [09:23:55] INFO: Results2: 0.48 seconds > > *[09:23:55] INFO: generating pattern fingerprints for mols * > *[09:24:53] INFO: Results3: 58.31 seconds* > [09:24:53] INFO: generating pattern fingerprints for queries > [09:24:54] INFO: Results4: 1.19 seconds > [09:24:54] INFO: testing frags queries > [09:25:41] INFO: Results5: 46.22 seconds. 6753 tested (0.0003 of total), > 3989
[Rdkit-discuss] Constructing a mol object from a PDB ligand
Hello, everyone Is there a simple way to create a mol object from just the HETATM/ligand lines from a pdb-file? Would it be viable to create a function where you could create a mol object from specific lines within a pdb-file? Illimar Rekand Ph.D. candidate, Brenk-lab, Haug-lab Department of Biomedicine Department of Chemistry University of Bergen ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss