Re: [Rdkit-discuss] access to properties block information from sd file
Hi, AFAIK, these properties are not parsed. The following list showing supported properties is extracted from function ParseMolBlockProperties in $RDBASE/Code/GraphMol/FileParsers/MolFileParser.cpp M ALS M ISO M RGP M RBC M SUB M UNS M CHG M RAD If you are comfortable about coding in C++, you should be able to quickly add support of other properties by duplicating one of the parsing functions (such as ParseIsotopeLine) and adapting from there. Regards, Eddie On Jun 8, 2011, at 1:40 PM, Donald Keidel wrote: > Eddie, > > Thank you for your response and help. > > I also have an SDF that looks like the following: > > -ISIS- 01241116242D > > 14 15 0 0 0 0 0 0 0 0999 V2000 > 7.4170 -5.11770. N 0 0 3 0 0 0 0 0 0 0 0 0 > 6.7037 -5.52920. C 0 0 0 0 0 0 0 0 0 0 0 0 > 7.4199 -4.29580. C 0 0 0 0 0 0 0 0 0 0 0 0 > 6.7037 -3.87920. C 0 0 0 0 0 0 0 0 0 0 0 0 > 5.9916 -5.12080. C 0 0 0 0 0 0 0 0 0 0 0 0 > 5.9876 -4.29550. N 0 0 3 0 0 0 0 0 0 0 0 0 > 5.2069 -4.03900. N 0 0 0 0 0 0 0 0 0 0 0 0 > 4.7234 -4.70180. C 0 0 0 0 0 0 0 0 0 0 0 0 > 5.2068 -5.37290. C 0 0 0 0 0 0 0 0 0 0 0 0 > 6.7037 -6.35420. O 0 0 0 0 0 0 0 0 0 0 0 0 > 4.9519 -6.16170. C 0 0 0 0 0 0 0 0 0 0 0 0 > 4.1491 -6.33740. O 0 0 0 0 0 0 0 0 0 0 0 0 > 8.1405 -5.53120. * 0 0 0 0 0 0 0 0 0 0 0 0 > 3.8984 -4.69770. * 0 0 0 0 0 0 0 0 0 0 0 0 > 2 10 2 0 0 0 0 > 5 6 1 0 0 0 0 > 9 11 1 0 0 0 0 > 11 12 1 0 0 0 0 > 6 4 1 0 0 0 0 > 5 2 1 0 0 0 0 > 2 1 1 0 0 0 0 > 1 3 1 0 0 0 0 > 6 7 1 0 0 0 0 > 7 8 2 0 0 0 0 > 1 13 1 0 0 0 0 > 8 9 1 0 0 0 0 > 8 14 1 0 0 0 0 > 3 4 2 0 0 0 0 > G 13 1 > R2 > G 14 8 > R1 > M STY 2 1 SUP 2 SUP > M SLB 2 1 1 2 2 > M SAL 1 1 13 > M SBL 1 1 11 > M SMT 1 R2 > M SBV 1 11 -0.72000.4100 > M SAL 2 1 14 > M SBL 2 1 14 > M SMT 2 R1 > M SBV 2 140.8300 -0.0100 > M END > > (1) > 1 > > > > CM0215 > > > > > > As you can see this one has S Group properties. Do you know if RDKit parses > this also? I have tried to look into this by doing the following per your > example: > > >>> from rdkit import Chem > >>> m = Chem.MolFromMolFile('ChemDiv_templates_1_first_one_only.sdf') > [13:33:27] deprecated group abbreviation ignored > [13:33:27] deprecated group abbreviation ignored > >>> a = m.GetAtoms()[12] > >>> prop = a.GetPropNames() > >>> prop[0] > '_CIPRank' > >>> prop[1] > '__computedProps' > > As you know you cannot GetProp yet due to potential bug, but I do not see S > Group Property at all. Is this correct? > > Thank you for your help. > > Don > > > > On 06/07/2011 05:29 PM, Eddie Cao wrote: >> >> prop = a.GetPropNames() -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] access to properties block information from sd file
Eddie, Thank you for your response and help. I also have an SDF that looks like the following: -ISIS- 01241116242D 14 15 0 0 0 0 0 0 0 0999 V2000 7.4170 -5.11770. N 0 0 3 0 0 0 0 0 0 0 0 0 6.7037 -5.52920. C 0 0 0 0 0 0 0 0 0 0 0 0 7.4199 -4.29580. C 0 0 0 0 0 0 0 0 0 0 0 0 6.7037 -3.87920. C 0 0 0 0 0 0 0 0 0 0 0 0 5.9916 -5.12080. C 0 0 0 0 0 0 0 0 0 0 0 0 5.9876 -4.29550. N 0 0 3 0 0 0 0 0 0 0 0 0 5.2069 -4.03900. N 0 0 0 0 0 0 0 0 0 0 0 0 4.7234 -4.70180. C 0 0 0 0 0 0 0 0 0 0 0 0 5.2068 -5.37290. C 0 0 0 0 0 0 0 0 0 0 0 0 6.7037 -6.35420. O 0 0 0 0 0 0 0 0 0 0 0 0 4.9519 -6.16170. C 0 0 0 0 0 0 0 0 0 0 0 0 4.1491 -6.33740. O 0 0 0 0 0 0 0 0 0 0 0 0 8.1405 -5.53120. * 0 0 0 0 0 0 0 0 0 0 0 0 3.8984 -4.69770. * 0 0 0 0 0 0 0 0 0 0 0 0 2 10 2 0 0 0 0 5 6 1 0 0 0 0 9 11 1 0 0 0 0 11 12 1 0 0 0 0 6 4 1 0 0 0 0 5 2 1 0 0 0 0 2 1 1 0 0 0 0 1 3 1 0 0 0 0 6 7 1 0 0 0 0 7 8 2 0 0 0 0 1 13 1 0 0 0 0 8 9 1 0 0 0 0 8 14 1 0 0 0 0 3 4 2 0 0 0 0 G 13 1 R2 G 14 8 R1 M STY 2 1 SUP 2 SUP M SLB 2 1 1 2 2 M SAL 1 1 13 M SBL 1 1 11 M SMT 1 R2 M SBV 1 11 -0.72000.4100 M SAL 2 1 14 M SBL 2 1 14 M SMT 2 R1 M SBV 2 140.8300 -0.0100 M END > (1) 1 > CM0215 As you can see this one has S Group properties. Do you know if RDKit parses this also? I have tried to look into this by doing the following per your example: >>> from rdkit import Chem >>> m = Chem.MolFromMolFile('ChemDiv_templates_1_first_one_only.sdf') [13:33:27] deprecated group abbreviation ignored [13:33:27] deprecated group abbreviation ignored >>> a = m.GetAtoms()[12] >>> prop = a.GetPropNames() >>> prop[0] '_CIPRank' >>> prop[1] '__computedProps' As you know you cannot GetProp yet due to potential bug, but I do not see S Group Property at all. Is this correct? Thank you for your help. Don On 06/07/2011 05:29 PM, Eddie Cao wrote: prop = a.GetPropNames() -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Sample RD Files?
Hi Greg, Thanks for the python-full reply! > # let's test the reaction to make sure it works. > # due to a (already reported) bug in the way atom properties > are handled, nrxn cannot be directly used, # so we use a hack > and reparse it: > >>> nrxn = AllChem.ReactionFromSmarts(AllChem.ReactionToSmarts(nrxn)) > >>> nrxn.Validate() > # now we can run a molecule through to make sure it works: > >>> nmol = Chem.MolFromSmiles('c1c1C') nps = > >>> nrxn.RunReactants((nmol,)) print Chem.MolToSmiles(nps[0][0]) > # output is: BrCc1c1 > > Is that what you're looking for? It certainly allows me to do what I want - which is get a mapped RXN out. And this can even be done with coordinates - which I have added below as a reminder to anyone (which included me until about 10 mins ago!) who had forgotten: AllChem.Compute2DCoordsForReaction(nrxn) rxnBlock = AllChem.ReactionToRxnBlock(nrxn) So Thanks very much! : ) - and thanks for the reminder about sanitizing products from reactions... > The molecules that come back from reactions have not been > sanitized, so all you need to do is add a call to > Chem.SanitizeMol first: > > >>> Chem.SanitizeMol(prod) Kind regards James __ PLEASE READ: This email is confidential and may be privileged. It is intended for the named addressee(s) only and access to it by anyone else is unauthorised. If you are not an addressee, any disclosure or copying of the contents of this email or any action taken (or not taken) in reliance on it is unauthorised and may be unlawful. If you have received this email in error, please notify the sender or postmas...@vernalis.com. Email is not a secure method of communication and the Company cannot accept responsibility for the accuracy or completeness of this message or any attachment(s). Please check this email for virus infection for which the Company accepts no responsibility. If verification of this email is sought then please request a hard copy. Unless otherwise stated, any views or opinions presented are solely those of the author and do not represent those of the Company. The Vernalis Group of Companies Oakdene Court 613 Reading Road Winnersh, Berkshire RG41 5UA. Tel: +44 118 977 3133 To access trading company registration and address details, please go to the Vernalis website at www.vernalis.com and click on the "Company address and registration details" link at the bottom of the page.. __ -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] ML, rdkit, 3class model
On Wed, Jun 8, 2011 at 1:35 PM, wrote: >> >> Aside from what looks like a mis-labeling of the data points (it looks >> like you've labeled the high solubility points low and vice versa), I >> don't see anything obviously wrong. I don't think that I would expect >> a fingerprint-based model to be able to do a particularly good job of >> being able to model solubility, so I'm not particularly surprised that >> you're getting bad stats. > > Regarding the classification, this should be correct. The solubility data > is given in the molar range. Therefore, very low logS numbers (< -3) > correspond to very insoluble compounds. > Or am I wrong? There's just a disconnect in the text on the wiki page, where you say that there are 417 compounds in class A (low): (A) low 417 compounds (B) medium 402 compounds (C) high206 compounds and the confusion matrix at the bottom that shows 417 compounds in class C: calc (A) (B) (C) e (A) 4 16 186 x (B) 0 100 302 p (C) 0 23 394 -greg -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] ML, rdkit, 3class model
Dear all, > > Dear Paul, > > On Tue, Jun 7, 2011 at 4:54 PM, wrote: > > > > Dear folks, > > > > finally, I updated the Wiki entry for the 3class model: > > http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel > > > > > > Do you have any explanation for the bad statistics? [see at the end "The > > output"] > > > > Of course, this is not a simple question. But maybe I did stupid mistake.. > > > > Aside from what looks like a mis-labeling of the data points (it looks > like you've labeled the high solubility points low and vice versa), I > don't see anything obviously wrong. I don't think that I would expect > a fingerprint-based model to be able to do a particularly good job of > being able to model solubility, so I'm not particularly surprised that > you're getting bad stats. Regarding the classification, this should be correct. The solubility data is given in the molar range. Therefore, very low logS numbers (< -3) correspond to very insoluble compounds. Or am I wrong? > > I tried your dataset using a descriptor-based model based on the > example in this wiki page: > http://code.google.com/p/rdkit/wiki/BuildingModelsUsingDescriptors1 > The descriptor calculation stays the same, but I had to tell the > model-building code to use three classes, which involved changing the > line: > nPossible = [0]+[2]*ndescrs+[2] > to: > nPossible = [0]+[2]*ndescrs+[3] > and the call to ShowVoteResults from: > res = ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 2, 0, > errorEstimate=True) > to: > res = ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 3, 0, > errorEstimate=True) > > Doing that I get the following results: > >*** Vote Results *** > misclassified: 232/1025 (%22.63) 232/1025 (%22.63) > > average correct confidence:0.9485 > average incorrect confidence: 0.8434 > >Results Table: > > 132 40 0 | 63.77 > 73 313 69 | 77.67 >1 49 348 | 83.25 > --- --- --- >64.08 77.86 83.45 > The statistics looks indeed better. Thanks so far, Paul > Which shows that something something reasonable is happening. It's not > (IMO) a bad start for a solubility model, certainly from here one > could start doing parameter tuning to try and improve the model. > > -greg This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://disclaimer.merck.de to access the German, French, Spanish and Portuguese versions of this disclaimer. -- EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Sample RD Files?
[Replying to the list since this may be of general interest] On Wed, Jun 8, 2011 at 10:00 AM, James Davidson wrote: > > Apologies for the confusion - I will try to explain by example: > from rdkit import Chem from rdkit.Chem import AllChem rsmarts="[C;X4;!H0:1][a:2]>>[*:1](Br)[*:2]" rxn=AllChem.ReactionFromSmarts(rsmarts) print AllChem.ReactionToRxnBlock(rxn) > $RXN > > RDKit > > 1 1 > $MOL > > RDKit > > 2 1 0 0 0 0 0 0 0 0999 V2000 > 0. 0. 0. * 0 0 0 0 0 0 0 0 0 1 0 0 > 0. 0. 0. * 0 0 0 0 0 0 0 0 0 2 0 0 > 1 2 1 0 > V 1 [C&X4&!H0:1] > V 2 [a:2] > M END > $MOL > > RDKit > > 3 2 0 0 0 0 0 0 0 0999 V2000 > 0. 0. 0. * 0 0 0 0 0 0 0 0 0 1 0 0 > 0. 0. 0. Br 0 0 0 0 0 0 0 0 0 0 0 0 > 0. 0. 0. * 0 0 0 0 0 0 0 0 0 2 0 0 > 2 1 1 0 > 1 3 1 0 > V 1 [*:1] > V 3 [*:2] > M END > > > > So, we have a rxn that can be outputted as an atom-mapped reaction block > (carrying query features). What I want is the corresponding atom-mapping, > but applied to the reactant and product of cases where the reactant is a > match: > smiles="c1c1C" mol=Chem.MolFromSmiles(smiles) prods = rxn.RunReactants((mol,)) prod = prods[0][0] > > So in this case I would like to get at the following (pseudo reaction for > illustrative purposes): > > c1c{1}1C{2} >> c1c{1}1C{2}Br > > (ie the 'specific' reaction, but carrying the atom numbers from the template > - {n} denoting atom-map) > > As I said before, I am a bit rusty on this - so I may very well be missing > something obvious! This isn't super straightforward, but it is do-able. I'll show it for this specific example, if you have problems generalizing, we can iterate here. # First get everything setup as you did above, including getting a product: >>> from rdkit import Chem >>> from rdkit.Chem import AllChem >>> rsmarts="[C;X4;!H0:1][a:2]>>[*:1](Br)[*:2]" >>> rxn=AllChem.ReactionFromSmarts(rsmarts) >>> smiles="c1c1C" >>> mol=Chem.MolFromSmiles(smiles) >>> prods = rxn.RunReactants((mol,)) >>> prod = prods[0][0] >>> Chem.SanitizeMol(prod) # Now get copies of the reactant and product templates from the reaction: >>> rtmpl = rxn.GetReactantTemplate(0) >>> ptmpl = rxn.GetProductTemplate(0) # Find the mapping of the reactant template onto the reactant molecule: >>> match = mol.GetSubstructMatch(rtmpl) # copy the atom mapping information from the template to the reactant molecule: >>> for tmplId,molId in enumerate(match): ...if rtmpl.GetAtomWithIdx(tmplId).HasProp('molAtomMapNumber'): ... at = mol.GetAtomWithIdx(molId) ... at.SetProp('molAtomMapNumber',rtmpl.GetAtomWithIdx(tmplId).GetProp('molAtomMapNumber')) ... # this is a hack. we need a query associated with an atom to see the mapping info in SMARTS : ... Chem.AddRecursiveQuery(mol,Chem.MolFromSmarts('*'),molId) # repeat that process for the product: >>> match = prod.GetSubstructMatch(ptmpl) # copy the atom mapping information from the template to the reactant molecule: >>> for tmplId,molId in enumerate(match): ...if ptmpl.GetAtomWithIdx(tmplId).HasProp('molAtomMapNumber'): ... at = prod.GetAtomWithIdx(molId) ... at.SetProp('molAtomMapNumber',ptmpl.GetAtomWithIdx(tmplId).GetProp('molAtomMapNumber')) ... # this is a hack. we need a query associated with an atom to see the mapping info in SMARTS : ... Chem.AddRecursiveQuery(prod,Chem.MolFromSmarts('*'),molId) # test that the SMARTS for those are ok: >>> print Chem.MolToSmarts(mol) # output is: [#6]1:[#6]:[#6]:[#6]:[#6]:[#6&$(*):2]:1-[#6&$(*):1] >>> print Chem.MolToSmarts(prod) # output is: [#6&$(*):1](-[#35])-[#6&$(*):2]1:[#6]:[#6]:[#6]:[#6]:[#6]:1 # now build a new reaction: >>> nrxn = AllChem.ChemicalReaction() >>> nrxn.AddReactantTemplate(mol) >>> nrxn.AddProductTemplate(prod) # we can now get the SMARTS for the reaction: >>> print AllChem.ReactionToSmarts(nrxn) # output is: [#6]1:[#6]:[#6]:[#6]:[#6]:[#6&$(*):2]:1-[#6&$(*):1]>>[#6&$(*):1](-[#35])-[#6&$(*):2]1:[#6]:[#6]:[#6]:[#6]:[#6]:1 # let's test the reaction to make sure it works. # due to a (already reported) bug in the way atom properties are handled, nrxn cannot be directly used, # so we use a hack and reparse it: >>> nrxn = AllChem.ReactionFromSmarts(AllChem.ReactionToSmarts(nrxn)) >>> nrxn.Validate() # now we can run a molecule through to make sure it works: >>> nmol = Chem.MolFromSmiles('c1c1C') >>> nps = nrxn.RunReactants((nmol,)) >>> print Chem.MolToSmiles(nps[0][0]) # output is: BrCc1c1 Is that what you're looking for? > PS I should say that I may have made a mistake with this example(?) because > I kept getting an error when trying to get the MolBlock of the product that > was rectified if I returned the Smiles first: > prod = prods[0][0] print