Re: [Rdkit-discuss] canonical SMILES of a fragment

2017-08-02 Thread Pavel Polishchuk

Thanks Greg!

  I found an alternative solution which is also no so straightforward. 
I set an isotope label to aromatic atoms, generate isomeric SMILES and 
make regex replacement.


  But your suggestion to set remove hydrogens is important, since this 
can cause other ambiguity.



import re

m = RWMol()

for i in range(3):
a = Atom(6)
a.SetNoImplicit(True)  # remove implicit Hs
m.AddAtom(a)
a = Atom(0)
m.AddAtom(a)

m.GetAtomWithIdx(0).SetIsAromatic(True)  # set aromatic
m.GetAtomWithIdx(0).SetIsotope(42)   # set isotope

m.GetAtomWithIdx(3).SetAtomMapNum(1)

m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

s = Chem.MolToSmiles(m, isomericSmiles=True)

re.sub('\[[0-9]+([a-z]+)H?[0-9]?\]', '\\1', s)  # remove isotope in 
output SMILES


OUTPUT: 'CC(c)[*:1]'

Pavel.




On 08/02/2017 06:24 AM, Greg Landrum wrote:

Hi Pavel,

It is, unfortunately, not that easy.
The canonicalization algorithm does not use atomic aromaticity when 
determining atom ordering, so as far as it is concerned there is no 
difference between atoms 0 and 2 in either of your examples. What does 
get used is the number of hydrogens, so you need to use that in order 
to get the results you are looking for.[1] For technical reasons, you 
also need to tell the RDKit that the atoms should not have implicit Hs 
attached. Here's a gist that works for me: 
https://gist.github.com/greglandrum/f4e2f2f2ad311560d8ab36874d503843


Two notes:
 1) I don't set the number of Hs on atom 1 in that gist, but I would 
suggest doing that too.
 2) If atoms 0 and 2 have the same number of Hs attached, this still 
is not going to work if you're building things from fragments. The 
canonicalization code was not really designed to be used in situations 
like this.


-greg
[1] The details of the canonicalization algorithm, including the 
contents of the atom invariants, are described here: 
http://dx.doi.org/10.1021/acs.jcim.5b00543



On Tue, Aug 1, 2017 at 2:53 PM, Pavel Polishchuk 
> wrote:


Hi all,

  canonicalization of fragment SMILES does not work properly.
Below there are two examples of identical fragments. The only
difference is the order of atoms (indices). However, it seems that
RDKit canonicalization does not take into account atom types.

  Does someone have an idea how to solve this issue with small losses?

#1 ===

m = RWMol()

for i in range(3):
a = Atom(6)
m.AddAtom(a)
a = Atom(0)
m.AddAtom(a)

m.GetAtomWithIdx(0).SetIsAromatic(True)  # set atom 0 as aromatic
m.GetAtomWithIdx(3).SetAtomMapNum(1)


m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m)

OUTPUT: 'cC(C)[*:1]'

#2 ===

m2 = RWMol()

for i in range(3):
a = Atom(6)
m2.AddAtom(a)
a = Atom(0)
m2.AddAtom(a)

m2.GetAtomWithIdx(2).SetIsAromatic(True) # set atom 2 as aromatic
m2.GetAtomWithIdx(3).SetAtomMapNum(1)


m2.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m2)

OUTPUT: 'CC(c)[*:1]'


Pavel.


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical SMILES of a fragment

2017-08-01 Thread Greg Landrum
Hi Pavel,

It is, unfortunately, not that easy.
The canonicalization algorithm does not use atomic aromaticity when
determining atom ordering, so as far as it is concerned there is no
difference between atoms 0 and 2 in either of your examples. What does get
used is the number of hydrogens, so you need to use that in order to get
the results you are looking for.[1] For technical reasons, you also need to
tell the RDKit that the atoms should not have implicit Hs attached. Here's
a gist that works for me:
https://gist.github.com/greglandrum/f4e2f2f2ad311560d8ab36874d503843

Two notes:
 1) I don't set the number of Hs on atom 1 in that gist, but I would
suggest doing that too.
 2) If atoms 0 and 2 have the same number of Hs attached, this still is not
going to work if you're building things from fragments. The
canonicalization code was not really designed to be used in situations like
this.

-greg
[1] The details of the canonicalization algorithm, including the contents
of the atom invariants, are described here:
http://dx.doi.org/10.1021/acs.jcim.5b00543


On Tue, Aug 1, 2017 at 2:53 PM, Pavel Polishchuk 
wrote:

> Hi all,
>
>   canonicalization of fragment SMILES does not work properly. Below there
> are two examples of identical fragments. The only difference is the order
> of atoms (indices). However, it seems that RDKit canonicalization does not
> take into account atom types.
>
>   Does someone have an idea how to solve this issue with small losses?
>
> #1 ===
>
> m = RWMol()
>
> for i in range(3):
> a = Atom(6)
> m.AddAtom(a)
> a = Atom(0)
> m.AddAtom(a)
>
> m.GetAtomWithIdx(0).SetIsAromatic(True)  # set atom 0 as aromatic
> m.GetAtomWithIdx(3).SetAtomMapNum(1)
>
>
> m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
> m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
> m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)
>
> Chem.MolToSmiles(m)
>
> OUTPUT: 'cC(C)[*:1]'
>
> #2 ===
>
> m2 = RWMol()
>
> for i in range(3):
> a = Atom(6)
> m2.AddAtom(a)
> a = Atom(0)
> m2.AddAtom(a)
>
> m2.GetAtomWithIdx(2).SetIsAromatic(True) # set atom 2 as aromatic
> m2.GetAtomWithIdx(3).SetAtomMapNum(1)
>
>
> m2.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
> m2.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
> m2.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)
>
> Chem.MolToSmiles(m2)
>
> OUTPUT: 'CC(c)[*:1]'
>
>
> Pavel.
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] canonical SMILES of a fragment

2017-08-01 Thread Pavel Polishchuk

Hi all,

  canonicalization of fragment SMILES does not work properly. Below 
there are two examples of identical fragments. The only difference is 
the order of atoms (indices). However, it seems that RDKit 
canonicalization does not take into account atom types.


  Does someone have an idea how to solve this issue with small losses?

#1 ===

m = RWMol()

for i in range(3):
a = Atom(6)
m.AddAtom(a)
a = Atom(0)
m.AddAtom(a)

m.GetAtomWithIdx(0).SetIsAromatic(True)  # set atom 0 as aromatic
m.GetAtomWithIdx(3).SetAtomMapNum(1)


m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m)

OUTPUT: 'cC(C)[*:1]'

#2 ===

m2 = RWMol()

for i in range(3):
a = Atom(6)
m2.AddAtom(a)
a = Atom(0)
m2.AddAtom(a)

m2.GetAtomWithIdx(2).SetIsAromatic(True) # set atom 2 as aromatic
m2.GetAtomWithIdx(3).SetAtomMapNum(1)


m2.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m2)

OUTPUT: 'CC(c)[*:1]'


Pavel.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical smiles for fragments with map numbers

2017-05-27 Thread Pavel Polishchuk

Thank you, Brian!

Actually what I expected as output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
and so on

You gave me the right direction. I can store old-new maps in a dict and 
after relabeling and producing of canonical smiles it would be easy to 
relabel attachment points back.

Thank you again!

Pavel.

On 05/27/2017 03:03 PM, Brian Kelley wrote:
Pavel, this isn't exactly trivial so I went ahead and made an 
example.  The basics are that atomMaps are canonicalized, i.e. their 
value is used in the generation of smiles.


To solve this problem:
1) backup the atom maps and remove them
2) canonicalize *without* atom maps but figure out the order in which 
the atoms in the molecule are output
3) using the atom output order, relabel the atom maps based on output 
order.


That's a mouthful, but here's some code that should do the trick:

from rdkit import Chem

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]


def CanonicalizeMaps(m, *a, **kw):
# atom maps are canonicalized, so rename them
#  figure out where they would have gone
#  and relabel from 1...N based on output order
atomMap = "molAtomMapNumber"
backupAtomMap = "oldMolAtomMapNumber"
for atom in m.GetAtoms():
if atom.HasProp(atomMap):
atomNum = atom.GetProp(atomMap)
atom.SetProp(backupAtomMap, atomNum)
atom.ClearProp(atomMap)

# canonicalize
smi = Chem.MolToSmiles(m, *a, **kw)
# where did the atoms end up in the output string?
atoms = [(pos, atom_idx) for atom_idx, pos in enumerate(
eval(m.GetProp("_smilesAtomOutputOrder")))]
atommap = 1
atoms.sort()

# set the new atommap based on output position
for pos, atom_idx in atoms:
atom = m.GetAtomWithIdx(atom_idx)
if atom.HasProp(backupAtomMap):
atom.SetProp(atomMap, str(atommap))
atommap +=1
return Chem.MolToSmiles(m)
for s in smi:
m = Chem.MolFromSmiles(s)
print CanonicalizeMaps(m,True)



Output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]

Now, if you want the atomMaps in 1...2...3 output order, we could do 
that as well, but it is even trickier.


Enjoy,
 Brian

On Sat, May 27, 2017 at 8:36 AM, Pavel Polishchuk 
> wrote:


Hi,

  I cannot solve an issue and would like to ask for an advice.
  If there are different map numbers for attachment points for the
same fragment different canonical smiles are generated.
  I observed such behavior only for fragments with 3 attachment
points. Below is an example.
  I'm looking for a solution/workaround how to produce the "same"
smiles strings irrespectively of mapping that after removal of map
numbers smiles will become identical.
  Any advice would be appreciated.

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]

for s in smi:
print(Chem.MolToSmiles(Chem.MolFromSmiles(s)))

output:
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
S=c1c([*:1])c([*:3])[nH]c(Cl)c1[*:2]
S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
S=c1c([*:1])c([*:2])[nH]c(Cl)c1[*:3]
S=c1c([*:2])c([*:1])[nH]c(Cl)c1[*:3]

Kind regards,
Pavel.


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical smiles for fragments with map numbers

2017-05-27 Thread Brian Kelley
Pavel, this isn't exactly trivial so I went ahead and made an example.  The
basics are that atomMaps are canonicalized, i.e. their value is used in the
generation of smiles.

To solve this problem:
1) backup the atom maps and remove them
2) canonicalize *without* atom maps but figure out the order in which the
atoms in the molecule are output
3) using the atom output order, relabel the atom maps based on output order.

That's a mouthful, but here's some code that should do the trick:

from rdkit import Chem

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]


def CanonicalizeMaps(m, *a, **kw):
# atom maps are canonicalized, so rename them
#  figure out where they would have gone
#  and relabel from 1...N based on output order
atomMap = "molAtomMapNumber"
backupAtomMap = "oldMolAtomMapNumber"

for atom in m.GetAtoms():
if atom.HasProp(atomMap):
atomNum = atom.GetProp(atomMap)
atom.SetProp(backupAtomMap, atomNum)
atom.ClearProp(atomMap)

# canonicalize
smi = Chem.MolToSmiles(m, *a, **kw)
# where did the atoms end up in the output string?
atoms = [(pos, atom_idx) for atom_idx, pos in enumerate(
eval(m.GetProp("_smilesAtomOutputOrder")))]
atommap = 1
atoms.sort()

# set the new atommap based on output position
for pos, atom_idx in atoms:
atom = m.GetAtomWithIdx(atom_idx)
if atom.HasProp(backupAtomMap):
atom.SetProp(atomMap, str(atommap))
atommap +=1

return Chem.MolToSmiles(m)

for s in smi:
m = Chem.MolFromSmiles(s)
print CanonicalizeMaps(m,True)



Output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]

Now, if you want the atomMaps in 1...2...3 output order, we could do that
as well, but it is even trickier.

Enjoy,
 Brian

On Sat, May 27, 2017 at 8:36 AM, Pavel Polishchuk 
wrote:

> Hi,
>
>   I cannot solve an issue and would like to ask for an advice.
>   If there are different map numbers for attachment points for the same
> fragment different canonical smiles are generated.
>   I observed such behavior only for fragments with 3 attachment points.
> Below is an example.
>   I'm looking for a solution/workaround how to produce the "same" smiles
> strings irrespectively of mapping that after removal of map numbers smiles
> will become identical.
>   Any advice would be appreciated.
>
> smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
>"ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
>"ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
>"ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
>"ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
>"ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]
>
> for s in smi:
> print(Chem.MolToSmiles(Chem.MolFromSmiles(s)))
>
> output:
> S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
> S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
> S=c1c([*:1])c([*:3])[nH]c(Cl)c1[*:2]
> S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
> S=c1c([*:1])c([*:2])[nH]c(Cl)c1[*:3]
> S=c1c([*:2])c([*:1])[nH]c(Cl)c1[*:3]
>
> Kind regards,
> Pavel.
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2011-01-04 Thread James Davidson
Hi Greg,

 On Sat, Dec 18, 2010 at 6:27 AM, Greg Landrum 
 greg.land...@gmail.com wrote:
 
 I just checked in a set of changes that should get this 
 (mostly) working correctly. Here's a demonstration with Geldanamycin:
 
 In [7]: 
 smi=r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C
 )C\C2=C(/OC)C(=O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'
 
 In [8]: print Chem.CanonSmiles(smi)
 COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)/C=C(\C)[...@h](OC(N
 )=O)[C@@H](OC)/C=C\C=C(/C)C(=O)NC(=CC1=O)C2=O

Thanks for looking into this so quickly!

 It would be *really* useful to have some more real-world 
 cases like this one to use as tests. So if you happen to have 
 others you can send I would be quite happy to have them.

On that note, I have added a comment to the bug tracker
(https://sourceforge.net/tracker/?func=detailaid=3139534group_id=16013
9atid=814650) - but was not sure how to attach a file (eg sdf) there,
so apologies for it ending up on more lines than I intended...  Also, I
logged in with my google account, but it looks like it may not be clear
who it is!

The first two examples are two marine natural products that only differ
in the geometry of the double bond in the medium ring.  The final
example is a cis- analogue that I synthesised during my PhD for which a
crystal structure was also obtained.  The stereochemistry in these
systems is 'challenging' to say the least, so I thought they would make
reasonable test cases.  I should say that even for the cis- double bond
cases, RDKit does a rather ugly job of the 2D depiction - but I am not
sure if other depictors will perform much better...

On a related note, I was keen to manually double-check the
stereochemistry that had been assigned to each of the chiral centres
(particularly the ones involving the 9-5 ring connections - as these are
potentially troublesome), and found myself wishing there was a way to
easily label a 2D depiction of the molecules with the atom ID.  What I
ended-up doing was the following:

1.  Getting the R/S info + atomIdx back from RDKit (example output):
 Chem.FindMolChiralCenters(mol)
[(3, 'R'), (7, 'R'), (8, 'S'), (9, 'R'), (11, 'R'), (18, 'R'), (24,
'R')]
2.  Opening the molfile in a program where I know how to label with atom
IDs (pymol)
3.  Check which atom is which manually (had to add 1 to the RDKit
atomIdx values as they start at 0) then double-check with reference
values.

RDKit performed admirably - but I presume this is dependant on the
quality of the wedge info coming in from the SDF(?)

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the Company address and 
registration details link at the bottom of the page..
__

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2010-12-28 Thread Greg Landrum
On Sat, Dec 18, 2010 at 6:27 AM, Greg Landrum greg.land...@gmail.com wrote:

   For 'classic' aliphatic systems, double-bonds in
  3-7-membered rings can only sensibly exist in the cis orientation, so
  'ignoring' them would be ok.  However, for 8-membered and above, cis or
  trans are certainly both possible, so it becomes more important to keep
  track - particularly if canonical smiles are being used to check for
  unique structures, as my colleague was doing with the geldanamycin
  example above.

 yeah, that's clear: for larger ring systems the information should be
 preserved. That's very easy to do. The more difficult part is going to
 be making sure the output is actually canonical. I've entered a bug
 for this 
 (https://sourceforge.net/tracker/?func=detailaid=3139534group_id=160139atid=814650)
 and I'll take a look to try and get it fixed (and correct).

I just checked in a set of changes that should get this (mostly)
working correctly. Here's a demonstration with Geldanamycin:

In [7]: 
smi=r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C)C\C2=C(/OC)C(=O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'

In [8]: print Chem.CanonSmiles(smi)
COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)/C=C(\C)[...@h](OC(N)=O)[C@@H](OC)/C=C\C=C(/C)C(=O)NC(=CC1=O)C2=O

At least according to Marvin, those two structures are the same.

One very important caveat: I have not modified the depiction code to
generate correct coordinates for trans bonds in cycles. All
coordinates for ring systems still have all cis bonds. This has an
impact if you write an SD (or mol) file : the stereochemistry captured
in that file will be incorrect. I've entered a bug report for this
(https://sourceforge.net/tracker/?func=detailaid=3147014group_id=160139atid=814650)
so that it doesn't get lost, but I suspect this is going to be a tough
one to fix and not at all sure when it will done.

It would be *really* useful to have some more real-world cases like
this one to use as tests. So if you happen to have others you can send
I would be quite happy to have them.

Best Regards,
-greg

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Canonical smiles for medium and large rings?

2010-12-17 Thread James Davidson
Dear All,
 
I have been investigating an issue that a colleague of mine identified.
He was working with the RDKit Canon Smiles node in Knime, and found that
for the natural product, Geldanamycin, the double-bond geometry
information was being lost during canonicalisation.  I repeated this
result outside of knime:
 
from rdkit import Chem
from rdkit.Chem import AllChem

 smi =
r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C)C\C2=C(/OC)C(
=O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'
 AllChem.CanonSmiles(smi)

'COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)C=C(C)[...@h](OC(N)=O)[C@@H](
OC)C=CC=C(C)C(=O)NC(=CC1=O)C2=O'


The simpler example below may be better:

 smi1 = r'O1CC/C=C\1' # cyclic ether
 smi2 = r'OCC/C=C\' # corresponding acyclic alcohol

 AllChem.CanonSmiles(smi1)
'C1C=CCCOCCC1' - stereochemistry lost
 AllChem.CanonSmiles(smi2)
'/C=C\\CCO' - stereochemistry retained


So, I am guessing that double-bonds in rings are being 'ignored'(?) by
the canonicaliser?  For 'classic' aliphatic systems, double-bonds in
3-7-membered rings can only sensibly exist in the cis orientation, so
'ignoring' them would be ok.  However, for 8-membered and above, cis or
trans are certainly both possible, so it becomes more important to keep
track - particularly if canonical smiles are being used to check for
unique structures, as my colleague was doing with the geldanamycin
example above.
 
Any thoughts / suggestions are much appreciated as always!

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the Company address and 
registration details link at the bottom of the page..
__

--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread George Oakman

Hi,

 

Thank you all very much for all the detailed information, the link to the Dr. 
Dobb's article might become very useful.

 

Does someone know if I can assume that the canonical SMILES of RDKit are the 
same as the Open Babel ones?

 

Am I doing something wrong in responding to the mailing list, it looks like all 
my answers are logged as a separate message as oposed to being logged in the 
same thread - please let me know, I don't want to make it all untidy!

 

Thanks.

 
 From: da...@dalkescientific.com
 Date: Fri, 13 Feb 2009 23:21:01 +0100
 To: rdkit-discuss@lists.sourceforge.net
 Subject: Re: [Rdkit-discuss] Canonical SMILES
 
 On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
  Yes, INnChI is unique across different packages. This is because
  there is one definitive source for the code and algorithm. This was
  a design goal of InChI.
 
 
 Or to twist TJ's words around .. it's exactly the same as with 
 canonical SMILES - every implementation of InChI does it a different 
 way. It's just that there's only one InChI implementation.
 
  The book I was referring to is An Introduction to 
  Chemoinformatics from A.R. Leach and V.J. Gillet. Yes, they refer 
  to the CANGEN algorithm and to the Weininger paper you mentioned.
  It doesn't matter, as long as I'm aware of the scope of 
  'uniqueness'.
 
 Then it's an eerie coincidence that Schneider and Baringhaus use 
 exactly the same example, with exactly the same SMILES. ;)
 
 http://books.google.com/books?id=feNn- 
 JcC1KgCpg=PA25lpg=PA25dq=canonical 
 +SMILESsource=webots=CeTadvKPxAsig=46za2byYVjkOtYM1cs5- 
 xs6Bch0hl=enei=ia2VSbf1FMyL- 
 gbbguWQCQsa=Xoi=book_resultresnum=6ct=result
 
 
  in this case probably to do with which branch to deal with first)
 
 
 As I recall when trying to implement the algorithm, the ambiguity is 
 in dealing with ties. The algorithm assigns a unique ordering to the 
 atoms, up to symmetry, but it's defined at the atom level. Given an 
 atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be 
 in the same symmetry class, but with different bond types going to B1 
 and B2.
 
 I asked Weininger about it and he said choose the highest order bond 
 first, which mostly works but I think can be ambiguous for a few 
 rare cases.
 
 There may be other under-specified aspects. I haven't looked at the 
 paper in 10 years.
 
 Brian Kelley wrote an article about canonicalization, with code, for 
 Dr. Dobb's magazine. It's online at
 http://www.ddj.com/architect/184405341
 
 The algorithm isn't that hard to implement, and it can be useful (at 
 very rare times) for doing things like canonicalizing SMARTS.
 
 
 Andrew
 da...@dalkescientific.com
 
 
 
 --
 Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
 -Strategies to boost innovation and cut costs with open source participation
 -Receive a $600 discount off the registration fee with the source code: SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Andrew Dalke

On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
Does someone know if I can assume that the canonical SMILES of  
RDKit are the same as the Open Babel ones?


I wouldn't assume that without a lot of testing. My assumption
is that canonical SMILES generation is so implementation
sensitive that it's very unlikely two systems would do it the
same way unless that was a deliberate goal.

Which I know wasn't the case with those two implementations.

I think also that RDKit pays more attention to handling
stereochemistry than OpenBabel.

Am I doing something wrong in responding to the mailing list, it  
looks like all my answers are logged as a separate message as  
oposed to being logged in the same thread - please let me know, I  
don't want to make it all untidy!


I don't use a threaded mail reader so I can't tell.

Andrew
da...@dalkescientific.com





Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Noel O'Boyle
2009/2/17 Andrew Dalke da...@dalkescientific.com:
 On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
 Does someone know if I can assume that the canonical SMILES of
 RDKit are the same as the Open Babel ones?

You can assume they are not the same. No attempt has been made to make
them consistent.

 I wouldn't assume that without a lot of testing. My assumption
 is that canonical SMILES generation is so implementation
 sensitive that it's very unlikely two systems would do it the
 same way unless that was a deliberate goal.

 Which I know wasn't the case with those two implementations.

 I think also that RDKit pays more attention to handling
 stereochemistry than OpenBabel.

 Am I doing something wrong in responding to the mailing list, it
 looks like all my answers are logged as a separate message as
 oposed to being logged in the same thread - please let me know, I
 don't want to make it all untidy!

 I don't use a threaded mail reader so I can't tell.
I use Gmail and everything is nicely threaded.

Andrew
da...@dalkescientific.com



 --
 Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
 -Strategies to boost innovation and cut costs with open source participation
 -Receive a $600 discount off the registration fee with the source code: SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Greg Landrum
On Fri, Feb 13, 2009 at 11:21 PM, Andrew Dalke
da...@dalkescientific.com wrote:
 On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
 Yes, INnChI is unique across different packages.  This is because
 there is one definitive source for the code and algorithm.  This was
 a design goal of InChI.


 Or to twist TJ's words around .. it's exactly the same as with
 canonical SMILES - every implementation of InChI does it a different
 way. It's just that there's only one InChI implementation.

And since IUPAC has not only done an open implementation with a
reasonable license, but also trademarked the name and placed the
restriction on its use that you can't call it InChI unless you pass
their validate suite, InChI will hopefully remain a portable
canonical identifier.

 in this case probably to do with which branch to deal with first)


 As I recall when trying to implement the algorithm, the ambiguity is
 in dealing with ties. The algorithm assigns a unique ordering to the
 atoms, up to symmetry, but it's defined at the atom level. Given an
 atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be
 in the same symmetry class, but with different bond types going to B1
 and B2.

 I asked Weininger about it and he said choose the highest order bond
 first, which mostly works but I think can be ambiguous for a few
 rare cases.

I don't recall any. The decision about which bond to follow first at a
branch is really the big one.

 There may be other under-specified aspects. I haven't looked at the
 paper in 10 years.

stereochemistry is one that immediately comes to mind

-greg



[Rdkit-discuss] Canonical SMILES

2009-02-13 Thread George Oakman

Hi all,
 
I am very new to the RDKit and am in the process of running a few test to 
understand how things are working.
 
One of the first example I have been playing with is the canonical SMILES for 
Aspirin. This is the piece of code I put together:
 
  RWMol *mol=new RWMol();
  //Atoms for Aspirin  mol-addAtom(new Atom(6));  mol-addAtom(new Atom(6));  
mol-addAtom(new Atom(6));  mol-addAtom(new Atom(6));  mol-addAtom(new 
Atom(6));  mol-addAtom(new Atom(6));  mol-addAtom(new Atom(6));  
mol-addAtom(new Atom(8));  mol-addAtom(new Atom(8));  mol-addAtom(new 
Atom(8));  mol-addAtom(new Atom(6));  mol-addAtom(new Atom(8));  
mol-addAtom(new Atom(6));
  //Bonds for Aspirin  mol-addBond(0,1,Bond::DOUBLE);  
mol-addBond(1,2,Bond::SINGLE);   mol-addBond(2,3,Bond::DOUBLE);  
mol-addBond(3,4,Bond::SINGLE);   mol-addBond(4,5,Bond::DOUBLE);   
mol-addBond(5,0,Bond::SINGLE);   mol-addBond(5,6,Bond::SINGLE);   
mol-addBond(6,7,Bond::SINGLE);   mol-addBond(6,8,Bond::DOUBLE);   
mol-addBond(4,9,Bond::SINGLE);   mol-addBond(9,10,Bond::SINGLE);   
mol-addBond(10,11,Bond::DOUBLE);   mol-addBond(10,12,Bond::SINGLE); 
  RDKit::MolOps::sanitizeMol(*mol);  std::string smiles;  smiles = 
MolToSmiles(*(static_castROMol *(mol)),true);   BOOST_LOG(rdInfoLog) 
CANONICAL SMILES FOR ASPIRIN:  smilesstd::endl;
This gave me the following result:
 
  CC(Oc1c1C(O)=O)=O
 
But I was expecting 
 
  CC(=O)Oc1c1C(=O)O)
 
In addition to being new to the RDKit, I'm also new to Cheminformatics in 
general, so my question may be silly, but I assumed the canonical SMILES for a 
given molecule is unique and was surprised to get a different SMILES to the one 
given in my textbook.
 
I would be very grateful if someone could help me understand why, as I am sure 
there's a very good explanation for this.
 
Many thanks for your help,
 
George.
 
 
 
 
 
_
Twice the fun—Share photos while you chat with Windows Live Messenger. Learn 
more.
http://www.microsoft.com/uk/windows/windowslive/products/messenger.aspx

Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread Andrew Dalke

On Feb 13, 2009, at 6:20 PM, George Oakman wrote:
One of the first example I have been playing with is the canonical  
SMILES for Aspirin.

..


This gave me the following result:

  CC(Oc1c1C(O)=O)=O

But I was expecting

  CC(=O)Oc1c1C(=O)O)


The canonical SMILES is canonical only on the context of an  
algorithm. The Daylight algorithm is different than the RDKit one is  
different from the OpenBabel one is different ... . In fact, the  
Daylight algorithm has changed over time to fix various problems.


When that happens, the molecules need to be re-canonicalized.

Even if you go back to the original Weininger paper, there are  
ambiguities in the description which make the result implementation- 
specific.


Is the book you're using Molecular Design by Gisbert Schneider and  
Karl-Heinz Baringhaus? That came up when I searched for canonical  
SMILES and I see it has example of aspirin with your expected SMILES.



Andrew
da...@dalkescientific.com