Re: [Rdkit-discuss] Sample RD Files?

2011-06-08 Thread Greg Landrum
[Replying to the list since this may be of general interest]

On Wed, Jun 8, 2011 at 10:00 AM, James Davidson j.david...@vernalis.com wrote:

 Apologies for the confusion - I will try to explain by example:

 from rdkit import Chem
 from rdkit.Chem import AllChem
 rsmarts=[C;X4;!H0:1][a:2][*:1](Br)[*:2]
 rxn=AllChem.ReactionFromSmarts(rsmarts)
 print AllChem.ReactionToRxnBlock(rxn)
 $RXN

   RDKit

   1  1
 $MOL

  RDKit

   2  1  0  0  0  0  0  0  0  0999 V2000
     0.    0.    0. *   0  0  0  0  0  0  0  0  0  1  0  0
     0.    0.    0. *   0  0  0  0  0  0  0  0  0  2  0  0
   1  2  1  0
 V    1 [CX4!H0:1]
 V    2 [a:2]
 M  END
 $MOL

  RDKit

   3  2  0  0  0  0  0  0  0  0999 V2000
     0.    0.    0. *   0  0  0  0  0  0  0  0  0  1  0  0
     0.    0.    0. Br  0  0  0  0  0  0  0  0  0  0  0  0
     0.    0.    0. *   0  0  0  0  0  0  0  0  0  2  0  0
   2  1  1  0
   1  3  1  0
 V    1 [*:1]
 V    3 [*:2]
 M  END



 So, we have a rxn that can be outputted as an atom-mapped reaction block
 (carrying query features).  What I want is the corresponding atom-mapping,
 but applied to the reactant and product of cases where the reactant is a
 match:

 smiles=c1c1C
 mol=Chem.MolFromSmiles(smiles)
 prods = rxn.RunReactants((mol,))
 prod = prods[0][0]

 So in this case I would like to get at the following (pseudo reaction for
 illustrative purposes):

 c1c{1}1C{2}  c1c{1}1C{2}Br

 (ie the 'specific' reaction, but carrying the atom numbers from the template
 - {n} denoting atom-map)

 As I said before, I am a bit rusty on this - so I may very well be missing
 something obvious!

This isn't super straightforward, but it is do-able. I'll show it for
this specific example, if you have problems generalizing, we can
iterate here.

# First get everything setup as you did above, including getting a product:
 from rdkit import Chem
 from rdkit.Chem import AllChem
 rsmarts=[C;X4;!H0:1][a:2][*:1](Br)[*:2]
 rxn=AllChem.ReactionFromSmarts(rsmarts)
 smiles=c1c1C
 mol=Chem.MolFromSmiles(smiles)
 prods = rxn.RunReactants((mol,))
 prod = prods[0][0]
 Chem.SanitizeMol(prod)

# Now get copies of the reactant and product templates from the reaction:
 rtmpl = rxn.GetReactantTemplate(0)
 ptmpl = rxn.GetProductTemplate(0)

# Find the mapping of the reactant template onto the reactant molecule:
 match = mol.GetSubstructMatch(rtmpl)

# copy the atom mapping information from the template to the reactant molecule:
 for tmplId,molId in enumerate(match):
...if rtmpl.GetAtomWithIdx(tmplId).HasProp('molAtomMapNumber'):
...   at = mol.GetAtomWithIdx(molId)
...   
at.SetProp('molAtomMapNumber',rtmpl.GetAtomWithIdx(tmplId).GetProp('molAtomMapNumber'))
...   # this is a hack. we need a query associated with an atom to
see the mapping info in SMARTS :
...   Chem.AddRecursiveQuery(mol,Chem.MolFromSmarts('*'),molId)

# repeat that process for the product:
 match = prod.GetSubstructMatch(ptmpl)

# copy the atom mapping information from the template to the reactant molecule:
 for tmplId,molId in enumerate(match):
...if ptmpl.GetAtomWithIdx(tmplId).HasProp('molAtomMapNumber'):
...   at = prod.GetAtomWithIdx(molId)
...   
at.SetProp('molAtomMapNumber',ptmpl.GetAtomWithIdx(tmplId).GetProp('molAtomMapNumber'))
...   # this is a hack. we need a query associated with an atom to
see the mapping info in SMARTS :
...   Chem.AddRecursiveQuery(prod,Chem.MolFromSmarts('*'),molId)

# test that the SMARTS for those are ok:
 print Chem.MolToSmarts(mol)
# output is: [#6]1:[#6]:[#6]:[#6]:[#6]:[#6$(*):2]:1-[#6$(*):1]
 print Chem.MolToSmarts(prod)
# output is: [#6$(*):1](-[#35])-[#6$(*):2]1:[#6]:[#6]:[#6]:[#6]:[#6]:1

# now build a new reaction:
 nrxn = AllChem.ChemicalReaction()
 nrxn.AddReactantTemplate(mol)
 nrxn.AddProductTemplate(prod)

# we can now get the SMARTS for the reaction:
 print AllChem.ReactionToSmarts(nrxn)
# output is: 
[#6]1:[#6]:[#6]:[#6]:[#6]:[#6$(*):2]:1-[#6$(*):1][#6$(*):1](-[#35])-[#6$(*):2]1:[#6]:[#6]:[#6]:[#6]:[#6]:1

# let's test the reaction to make sure it works.
# due to a (already reported) bug in the way atom properties are
handled, nrxn cannot be directly used,
# so we use a hack and reparse it:
 nrxn = AllChem.ReactionFromSmarts(AllChem.ReactionToSmarts(nrxn))
 nrxn.Validate()
# now we can run a molecule through to make sure it works:
 nmol = Chem.MolFromSmiles('c1c1C')
 nps = nrxn.RunReactants((nmol,))
 print Chem.MolToSmiles(nps[0][0])
# output is: BrCc1c1

Is that what you're looking for?

 PS  I should say that I may have made a mistake with this example(?) because
 I kept getting an error when trying to get the MolBlock of the product that
 was rectified if I returned the Smiles first:

 prod = prods[0][0]
 print Chem.MolToMolBlock(prod)
 Traceback (most recent call last):
   File pyshell#35, line 1, in module
     print Chem.MolToMolBlock(prod)
 RuntimeError: Pre-condition Violation

The 

Re: [Rdkit-discuss] ML, rdkit, 3class model

2011-06-08 Thread Paul . Czodrowski
Dear all,

 Dear Paul,

 On Tue, Jun 7, 2011 at 4:54 PM,  paul.czodrow...@merck.de wrote:
 
  Dear folks,
 
  finally, I updated the Wiki entry for the 3class model:
  http://code.google.com/p/rdkit/wiki/TrainAThreeClassSolubilityModel
 
 
  Do you have any explanation for the bad statistics? [see at the end
The
  output]
 
  Of course, this is not a simple question. But maybe I did stupid
mistake..
 

 Aside from what looks like a mis-labeling of the data points (it looks
 like you've labeled the high solubility points low and vice versa), I
 don't see anything obviously wrong. I don't think that I would expect
 a fingerprint-based model to be able to do a particularly good job of
 being able to model solubility, so I'm not particularly surprised that
 you're getting bad stats.

Regarding the classification, this should be correct. The solubility data
is given in the molar range. Therefore, very low logS numbers ( -3)
correspond to very insoluble compounds.
Or am I wrong?



 I tried your dataset using a descriptor-based model based on the
 example in this wiki page:
 http://code.google.com/p/rdkit/wiki/BuildingModelsUsingDescriptors1
 The descriptor calculation stays the same, but I had to tell the
 model-building code to use three classes, which involved changing the
 line:
 nPossible = [0]+[2]*ndescrs+[2]
 to:
 nPossible = [0]+[2]*ndescrs+[3]
 and the call to ShowVoteResults from:
 res = ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 2, 0,
 errorEstimate=True)
 to:
 res = ScreenComposite.ShowVoteResults(range(len(pts)), pts, cmp, 3, 0,
 errorEstimate=True)

 Doing that I get the following results:

*** Vote Results ***
 misclassified: 232/1025 (%22.63)   232/1025 (%22.63)

 average correct confidence:0.9485
 average incorrect confidence:  0.8434

Results Table:

  132  40   0  |  63.77
   73 313  69  |  77.67
1  49 348  |  83.25
  --- --- ---
64.08   77.86   83.45


The statistics looks indeed better.


Thanks so far,

Paul



 Which shows that something something reasonable is happening. It's not
 (IMO) a bad start for a solubility model, certainly from here one
 could start doing parameter tuning to try and improve the model.

 -greg

This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://disclaimer.merck.de to access the German, French, Spanish and
Portuguese versions of this disclaimer.


--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] ML, rdkit, 3class model

2011-06-08 Thread Greg Landrum
On Wed, Jun 8, 2011 at 1:35 PM,  paul.czodrow...@merck.de wrote:

 Aside from what looks like a mis-labeling of the data points (it looks
 like you've labeled the high solubility points low and vice versa), I
 don't see anything obviously wrong. I don't think that I would expect
 a fingerprint-based model to be able to do a particularly good job of
 being able to model solubility, so I'm not particularly surprised that
 you're getting bad stats.

 Regarding the classification, this should be correct. The solubility data
 is given in the molar range. Therefore, very low logS numbers ( -3)
 correspond to very insoluble compounds.
 Or am I wrong?

There's just a disconnect in the text on the wiki page, where you say
that there are 417 compounds in class A (low):

(A) low 417 compounds
(B) medium  402 compounds
(C) high206 compounds

 and the confusion matrix at the bottom that shows 417 compounds in class C:

calc
 (A) (B)  (C)
e (A)  4  16  186
x (B)  0 100  302
p (C)  0  23  394

-greg

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Sample RD Files?

2011-06-08 Thread James Davidson
Hi Greg,

Thanks for the python-full reply!

 # let's test the reaction to make sure it works.
 # due to a (already reported) bug in the way atom properties 
 are handled, nrxn cannot be directly used, # so we use a hack 
 and reparse it:
  nrxn = AllChem.ReactionFromSmarts(AllChem.ReactionToSmarts(nrxn))
  nrxn.Validate()
 # now we can run a molecule through to make sure it works:
  nmol = Chem.MolFromSmiles('c1c1C') nps = 
  nrxn.RunReactants((nmol,)) print Chem.MolToSmiles(nps[0][0])
 # output is: BrCc1c1
 
 Is that what you're looking for?

It certainly allows me to do what I want - which is get a mapped RXN
out.  And this can even be done with coordinates - which I have added
below as a reminder to anyone (which included me until about 10 mins
ago!) who had forgotten:

AllChem.Compute2DCoordsForReaction(nrxn)
rxnBlock = AllChem.ReactionToRxnBlock(nrxn)


So Thanks very much!  : )

 - and thanks for the reminder about sanitizing products from
reactions...

 The molecules that come back from reactions have not been 
 sanitized, so all you need to do is add a call to 
 Chem.SanitizeMol first:
 
  Chem.SanitizeMol(prod)

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the Company address and 
registration details link at the bottom of the page..
__

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] access to properties block information from sd file

2011-06-08 Thread Donald Keidel

Eddie,

Thank you for your response and help.

I also have an SDF that looks like the following:

  -ISIS-  01241116242D

 14 15  0  0  0  0  0  0  0  0999 V2000
7.4170   -5.11770. N   0  0  3  0  0  0  0  0  0  0  0  0
6.7037   -5.52920. C   0  0  0  0  0  0  0  0  0  0  0  0
7.4199   -4.29580. C   0  0  0  0  0  0  0  0  0  0  0  0
6.7037   -3.87920. C   0  0  0  0  0  0  0  0  0  0  0  0
5.9916   -5.12080. C   0  0  0  0  0  0  0  0  0  0  0  0
5.9876   -4.29550. N   0  0  3  0  0  0  0  0  0  0  0  0
5.2069   -4.03900. N   0  0  0  0  0  0  0  0  0  0  0  0
4.7234   -4.70180. C   0  0  0  0  0  0  0  0  0  0  0  0
5.2068   -5.37290. C   0  0  0  0  0  0  0  0  0  0  0  0
6.7037   -6.35420. O   0  0  0  0  0  0  0  0  0  0  0  0
4.9519   -6.16170. C   0  0  0  0  0  0  0  0  0  0  0  0
4.1491   -6.33740. O   0  0  0  0  0  0  0  0  0  0  0  0
8.1405   -5.53120. *   0  0  0  0  0  0  0  0  0  0  0  0
3.8984   -4.69770. *   0  0  0  0  0  0  0  0  0  0  0  0
  2 10  2  0  0  0  0
  5  6  1  0  0  0  0
  9 11  1  0  0  0  0
 11 12  1  0  0  0  0
  6  4  1  0  0  0  0
  5  2  1  0  0  0  0
  2  1  1  0  0  0  0
  1  3  1  0  0  0  0
  6  7  1  0  0  0  0
  7  8  2  0  0  0  0
  1 13  1  0  0  0  0
  8  9  1  0  0  0  0
  8 14  1  0  0  0  0
  3  4  2  0  0  0  0
G   13  1
R2
G   14  8
R1
M  STY  2   1 SUP   2 SUP
M  SLB  2   1   1   2   2
M  SAL   1  1  13
M  SBL   1  1  11
M  SMT   1 R2
M  SBV   1  11   -0.72000.4100
M  SAL   2  1  14
M  SBL   2  1  14
M  SMT   2 R1
M  SBV   2  140.8300   -0.0100
M  END
 ID (1)
1

 CATALOG_NUMBER
CM0215





As you can see this one has S Group properties.  Do you know if RDKit 
parses this also?  I have tried to look into this by doing the following 
per your example:


 from rdkit import Chem
 m = Chem.MolFromMolFile('ChemDiv_templates_1_first_one_only.sdf')
[13:33:27]  deprecated group abbreviation ignored
[13:33:27]  deprecated group abbreviation ignored
 a = m.GetAtoms()[12]
 prop = a.GetPropNames()
 prop[0]
'_CIPRank'
 prop[1]
'__computedProps'

As you know you cannot GetProp yet due to potential bug, but I do not 
see S Group Property at all.  Is this correct?


Thank you for your help.

Don



On 06/07/2011 05:29 PM, Eddie Cao wrote:

prop = a.GetPropNames()
--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] access to properties block information from sd file

2011-06-08 Thread Eddie Cao
Hi,

AFAIK, these properties are not parsed. The following list showing supported 
properties is extracted from function ParseMolBlockProperties in 
$RDBASE/Code/GraphMol/FileParsers/MolFileParser.cpp

M  ALS
M  ISO
M  RGP
M  RBC
M  SUB
M  UNS
M  CHG
M  RAD

If you are comfortable about coding in C++, you should be able to quickly add 
support of other properties by duplicating one of the parsing functions (such 
as ParseIsotopeLine) and adapting from there.

Regards,
Eddie

On Jun 8, 2011, at 1:40 PM, Donald Keidel wrote:

 Eddie,
 
 Thank you for your response and help.
 
 I also have an SDF that looks like the following:
 
   -ISIS-  01241116242D
 
  14 15  0  0  0  0  0  0  0  0999 V2000
 7.4170   -5.11770. N   0  0  3  0  0  0  0  0  0  0  0  0
 6.7037   -5.52920. C   0  0  0  0  0  0  0  0  0  0  0  0
 7.4199   -4.29580. C   0  0  0  0  0  0  0  0  0  0  0  0
 6.7037   -3.87920. C   0  0  0  0  0  0  0  0  0  0  0  0
 5.9916   -5.12080. C   0  0  0  0  0  0  0  0  0  0  0  0
 5.9876   -4.29550. N   0  0  3  0  0  0  0  0  0  0  0  0
 5.2069   -4.03900. N   0  0  0  0  0  0  0  0  0  0  0  0
 4.7234   -4.70180. C   0  0  0  0  0  0  0  0  0  0  0  0
 5.2068   -5.37290. C   0  0  0  0  0  0  0  0  0  0  0  0
 6.7037   -6.35420. O   0  0  0  0  0  0  0  0  0  0  0  0
 4.9519   -6.16170. C   0  0  0  0  0  0  0  0  0  0  0  0
 4.1491   -6.33740. O   0  0  0  0  0  0  0  0  0  0  0  0
 8.1405   -5.53120. *   0  0  0  0  0  0  0  0  0  0  0  0
 3.8984   -4.69770. *   0  0  0  0  0  0  0  0  0  0  0  0
   2 10  2  0  0  0  0
   5  6  1  0  0  0  0
   9 11  1  0  0  0  0
  11 12  1  0  0  0  0
   6  4  1  0  0  0  0
   5  2  1  0  0  0  0
   2  1  1  0  0  0  0
   1  3  1  0  0  0  0
   6  7  1  0  0  0  0
   7  8  2  0  0  0  0
   1 13  1  0  0  0  0
   8  9  1  0  0  0  0
   8 14  1  0  0  0  0
   3  4  2  0  0  0  0
 G   13  1
 R2
 G   14  8
 R1
 M  STY  2   1 SUP   2 SUP
 M  SLB  2   1   1   2   2
 M  SAL   1  1  13
 M  SBL   1  1  11
 M  SMT   1 R2
 M  SBV   1  11   -0.72000.4100
 M  SAL   2  1  14
 M  SBL   2  1  14
 M  SMT   2 R1
 M  SBV   2  140.8300   -0.0100
 M  END
   ID (1)
 1
 
   CATALOG_NUMBER
 CM0215
 
 
 
 
 
 As you can see this one has S Group properties.  Do you know if RDKit parses 
 this also?  I have tried to look into this by doing the following per your 
 example:
 
  from rdkit import Chem
  m = Chem.MolFromMolFile('ChemDiv_templates_1_first_one_only.sdf')
 [13:33:27]  deprecated group abbreviation ignored
 [13:33:27]  deprecated group abbreviation ignored
  a = m.GetAtoms()[12]
  prop = a.GetPropNames()
  prop[0]
 '_CIPRank'
  prop[1]
 '__computedProps'
 
 As you know you cannot GetProp yet due to potential bug, but I do not see S 
 Group Property at all.  Is this correct?
 
 Thank you for your help.
 
 Don
 
 
 
 On 06/07/2011 05:29 PM, Eddie Cao wrote:
 
 prop = a.GetPropNames()

--
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss