Re: [Rdkit-discuss] canonical SMILES of a fragment

2017-08-01 Thread Pavel Polishchuk

Thanks Greg!

  I found an alternative solution which is also no so straightforward. 
I set an isotope label to aromatic atoms, generate isomeric SMILES and 
make regex replacement.


  But your suggestion to set remove hydrogens is important, since this 
can cause other ambiguity.



import re

m = RWMol()

for i in range(3):
a = Atom(6)
a.SetNoImplicit(True)  # remove implicit Hs
m.AddAtom(a)
a = Atom(0)
m.AddAtom(a)

m.GetAtomWithIdx(0).SetIsAromatic(True)  # set aromatic
m.GetAtomWithIdx(0).SetIsotope(42)   # set isotope

m.GetAtomWithIdx(3).SetAtomMapNum(1)

m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

s = Chem.MolToSmiles(m, isomericSmiles=True)

re.sub('\[[0-9]+([a-z]+)H?[0-9]?\]', '\\1', s)  # remove isotope in 
output SMILES


OUTPUT: 'CC(c)[*:1]'

Pavel.




On 08/02/2017 06:24 AM, Greg Landrum wrote:

Hi Pavel,

It is, unfortunately, not that easy.
The canonicalization algorithm does not use atomic aromaticity when 
determining atom ordering, so as far as it is concerned there is no 
difference between atoms 0 and 2 in either of your examples. What does 
get used is the number of hydrogens, so you need to use that in order 
to get the results you are looking for.[1] For technical reasons, you 
also need to tell the RDKit that the atoms should not have implicit Hs 
attached. Here's a gist that works for me: 
https://gist.github.com/greglandrum/f4e2f2f2ad311560d8ab36874d503843


Two notes:
 1) I don't set the number of Hs on atom 1 in that gist, but I would 
suggest doing that too.
 2) If atoms 0 and 2 have the same number of Hs attached, this still 
is not going to work if you're building things from fragments. The 
canonicalization code was not really designed to be used in situations 
like this.


-greg
[1] The details of the canonicalization algorithm, including the 
contents of the atom invariants, are described here: 
http://dx.doi.org/10.1021/acs.jcim.5b00543



On Tue, Aug 1, 2017 at 2:53 PM, Pavel Polishchuk 
mailto:pavel_polishc...@ukr.net>> wrote:


Hi all,

  canonicalization of fragment SMILES does not work properly.
Below there are two examples of identical fragments. The only
difference is the order of atoms (indices). However, it seems that
RDKit canonicalization does not take into account atom types.

  Does someone have an idea how to solve this issue with small losses?

#1 ===

m = RWMol()

for i in range(3):
a = Atom(6)
m.AddAtom(a)
a = Atom(0)
m.AddAtom(a)

m.GetAtomWithIdx(0).SetIsAromatic(True)  # set atom 0 as aromatic
m.GetAtomWithIdx(3).SetAtomMapNum(1)


m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m)

OUTPUT: 'cC(C)[*:1]'

#2 ===

m2 = RWMol()

for i in range(3):
a = Atom(6)
m2.AddAtom(a)
a = Atom(0)
m2.AddAtom(a)

m2.GetAtomWithIdx(2).SetIsAromatic(True) # set atom 2 as aromatic
m2.GetAtomWithIdx(3).SetAtomMapNum(1)


m2.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
m2.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)

Chem.MolToSmiles(m2)

OUTPUT: 'CC(c)[*:1]'


Pavel.


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical SMILES of a fragment

2017-08-01 Thread Greg Landrum
Hi Pavel,

It is, unfortunately, not that easy.
The canonicalization algorithm does not use atomic aromaticity when
determining atom ordering, so as far as it is concerned there is no
difference between atoms 0 and 2 in either of your examples. What does get
used is the number of hydrogens, so you need to use that in order to get
the results you are looking for.[1] For technical reasons, you also need to
tell the RDKit that the atoms should not have implicit Hs attached. Here's
a gist that works for me:
https://gist.github.com/greglandrum/f4e2f2f2ad311560d8ab36874d503843

Two notes:
 1) I don't set the number of Hs on atom 1 in that gist, but I would
suggest doing that too.
 2) If atoms 0 and 2 have the same number of Hs attached, this still is not
going to work if you're building things from fragments. The
canonicalization code was not really designed to be used in situations like
this.

-greg
[1] The details of the canonicalization algorithm, including the contents
of the atom invariants, are described here:
http://dx.doi.org/10.1021/acs.jcim.5b00543


On Tue, Aug 1, 2017 at 2:53 PM, Pavel Polishchuk 
wrote:

> Hi all,
>
>   canonicalization of fragment SMILES does not work properly. Below there
> are two examples of identical fragments. The only difference is the order
> of atoms (indices). However, it seems that RDKit canonicalization does not
> take into account atom types.
>
>   Does someone have an idea how to solve this issue with small losses?
>
> #1 ===
>
> m = RWMol()
>
> for i in range(3):
> a = Atom(6)
> m.AddAtom(a)
> a = Atom(0)
> m.AddAtom(a)
>
> m.GetAtomWithIdx(0).SetIsAromatic(True)  # set atom 0 as aromatic
> m.GetAtomWithIdx(3).SetAtomMapNum(1)
>
>
> m.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
> m.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
> m.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)
>
> Chem.MolToSmiles(m)
>
> OUTPUT: 'cC(C)[*:1]'
>
> #2 ===
>
> m2 = RWMol()
>
> for i in range(3):
> a = Atom(6)
> m2.AddAtom(a)
> a = Atom(0)
> m2.AddAtom(a)
>
> m2.GetAtomWithIdx(2).SetIsAromatic(True) # set atom 2 as aromatic
> m2.GetAtomWithIdx(3).SetAtomMapNum(1)
>
>
> m2.AddBond(0, 1, Chem.rdchem.BondType.SINGLE)
> m2.AddBond(1, 2, Chem.rdchem.BondType.SINGLE)
> m2.AddBond(1, 3, Chem.rdchem.BondType.SINGLE)
>
> Chem.MolToSmiles(m2)
>
> OUTPUT: 'CC(c)[*:1]'
>
>
> Pavel.
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical smiles for fragments with map numbers

2017-05-27 Thread Pavel Polishchuk

Thank you, Brian!

Actually what I expected as output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
and so on

You gave me the right direction. I can store old-new maps in a dict and 
after relabeling and producing of canonical smiles it would be easy to 
relabel attachment points back.

Thank you again!

Pavel.

On 05/27/2017 03:03 PM, Brian Kelley wrote:
Pavel, this isn't exactly trivial so I went ahead and made an 
example.  The basics are that atomMaps are canonicalized, i.e. their 
value is used in the generation of smiles.


To solve this problem:
1) backup the atom maps and remove them
2) canonicalize *without* atom maps but figure out the order in which 
the atoms in the molecule are output
3) using the atom output order, relabel the atom maps based on output 
order.


That's a mouthful, but here's some code that should do the trick:

from rdkit import Chem

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]


def CanonicalizeMaps(m, *a, **kw):
# atom maps are canonicalized, so rename them
#  figure out where they would have gone
#  and relabel from 1...N based on output order
atomMap = "molAtomMapNumber"
backupAtomMap = "oldMolAtomMapNumber"
for atom in m.GetAtoms():
if atom.HasProp(atomMap):
atomNum = atom.GetProp(atomMap)
atom.SetProp(backupAtomMap, atomNum)
atom.ClearProp(atomMap)

# canonicalize
smi = Chem.MolToSmiles(m, *a, **kw)
# where did the atoms end up in the output string?
atoms = [(pos, atom_idx) for atom_idx, pos in enumerate(
eval(m.GetProp("_smilesAtomOutputOrder")))]
atommap = 1
atoms.sort()

# set the new atommap based on output position
for pos, atom_idx in atoms:
atom = m.GetAtomWithIdx(atom_idx)
if atom.HasProp(backupAtomMap):
atom.SetProp(atomMap, str(atommap))
atommap +=1
return Chem.MolToSmiles(m)
for s in smi:
m = Chem.MolFromSmiles(s)
print CanonicalizeMaps(m,True)



Output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]

Now, if you want the atomMaps in 1...2...3 output order, we could do 
that as well, but it is even trickier.


Enjoy,
 Brian

On Sat, May 27, 2017 at 8:36 AM, Pavel Polishchuk 
mailto:pavel_polishc...@ukr.net>> wrote:


Hi,

  I cannot solve an issue and would like to ask for an advice.
  If there are different map numbers for attachment points for the
same fragment different canonical smiles are generated.
  I observed such behavior only for fragments with 3 attachment
points. Below is an example.
  I'm looking for a solution/workaround how to produce the "same"
smiles strings irrespectively of mapping that after removal of map
numbers smiles will become identical.
  Any advice would be appreciated.

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]

for s in smi:
print(Chem.MolToSmiles(Chem.MolFromSmiles(s)))

output:
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
S=c1c([*:1])c([*:3])[nH]c(Cl)c1[*:2]
S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
S=c1c([*:1])c([*:2])[nH]c(Cl)c1[*:3]
S=c1c([*:2])c([*:1])[nH]c(Cl)c1[*:3]

Kind regards,
Pavel.


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net

https://lists.sourceforge.net/lists/listinfo/rdkit-discuss





--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] canonical smiles for fragments with map numbers

2017-05-27 Thread Brian Kelley
Pavel, this isn't exactly trivial so I went ahead and made an example.  The
basics are that atomMaps are canonicalized, i.e. their value is used in the
generation of smiles.

To solve this problem:
1) backup the atom maps and remove them
2) canonicalize *without* atom maps but figure out the order in which the
atoms in the molecule are output
3) using the atom output order, relabel the atom maps based on output order.

That's a mouthful, but here's some code that should do the trick:

from rdkit import Chem

smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
   "ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
   "ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
   "ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
   "ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
   "ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]


def CanonicalizeMaps(m, *a, **kw):
# atom maps are canonicalized, so rename them
#  figure out where they would have gone
#  and relabel from 1...N based on output order
atomMap = "molAtomMapNumber"
backupAtomMap = "oldMolAtomMapNumber"

for atom in m.GetAtoms():
if atom.HasProp(atomMap):
atomNum = atom.GetProp(atomMap)
atom.SetProp(backupAtomMap, atomNum)
atom.ClearProp(atomMap)

# canonicalize
smi = Chem.MolToSmiles(m, *a, **kw)
# where did the atoms end up in the output string?
atoms = [(pos, atom_idx) for atom_idx, pos in enumerate(
eval(m.GetProp("_smilesAtomOutputOrder")))]
atommap = 1
atoms.sort()

# set the new atommap based on output position
for pos, atom_idx in atoms:
atom = m.GetAtomWithIdx(atom_idx)
if atom.HasProp(backupAtomMap):
atom.SetProp(atomMap, str(atommap))
atommap +=1

return Chem.MolToSmiles(m)

for s in smi:
m = Chem.MolFromSmiles(s)
print CanonicalizeMaps(m,True)



Output:

S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]

Now, if you want the atomMaps in 1...2...3 output order, we could do that
as well, but it is even trickier.

Enjoy,
 Brian

On Sat, May 27, 2017 at 8:36 AM, Pavel Polishchuk 
wrote:

> Hi,
>
>   I cannot solve an issue and would like to ask for an advice.
>   If there are different map numbers for attachment points for the same
> fragment different canonical smiles are generated.
>   I observed such behavior only for fragments with 3 attachment points.
> Below is an example.
>   I'm looking for a solution/workaround how to produce the "same" smiles
> strings irrespectively of mapping that after removal of map numbers smiles
> will become identical.
>   Any advice would be appreciated.
>
> smi = ["ClC1=C([*:1])C(=S)C([*:2])=C([*:3])N1",
>"ClC1=C([*:1])C(=S)C([*:3])=C([*:2])N1",
>"ClC1=C([*:2])C(=S)C([*:1])=C([*:3])N1",
>"ClC1=C([*:2])C(=S)C([*:3])=C([*:1])N1",
>"ClC1=C([*:3])C(=S)C([*:1])=C([*:2])N1",
>"ClC1=C([*:3])C(=S)C([*:2])=C([*:1])N1"]
>
> for s in smi:
> print(Chem.MolToSmiles(Chem.MolFromSmiles(s)))
>
> output:
> S=c1c([*:1])c(Cl)[nH]c([*:3])c1[*:2]
> S=c1c([*:1])c(Cl)[nH]c([*:2])c1[*:3]
> S=c1c([*:1])c([*:3])[nH]c(Cl)c1[*:2]
> S=c1c([*:2])c(Cl)[nH]c([*:1])c1[*:3]
> S=c1c([*:1])c([*:2])[nH]c(Cl)c1[*:3]
> S=c1c([*:2])c([*:1])[nH]c(Cl)c1[*:3]
>
> Kind regards,
> Pavel.
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2011-01-04 Thread Greg Landrum
James,

On Tue, Jan 4, 2011 at 6:29 PM, James Davidson  wrote:
>
>> It would be *really* useful to have some more real-world
>> cases like this one to use as tests. So if you happen to have
>> others you can send I would be quite happy to have them.
>
> On that note, I have added a comment to the bug tracker
> (https://sourceforge.net/tracker/?func=detail&aid=3139534&group_id=16013
> 9&atid=814650) - but was not sure how to attach a file (eg sdf) there,
> so apologies for it ending up on more lines than I intended...  Also, I
> logged in with my google account, but it looks like it may not be clear
> who it is!

Thanks for these. I just added a couple of initial tests based on
them. I will try to find the time to make them a bit more
comprehensive in the next couple of days.

> The first two examples are two marine natural products that only differ
> in the geometry of the double bond in the medium ring.  The final
> example is a cis- analogue that I synthesised during my PhD for which a
> crystal structure was also obtained.  The stereochemistry in these
> systems is 'challenging' to say the least, so I thought they would make
> reasonable test cases.  I should say that even for the cis- double bond
> cases, RDKit does a rather ugly job of the 2D depiction - but I am not
> sure if other depictors will perform much better...

Yeah, I'm afraid it's not going to do  a reasonable job with the
depiction of natural products. Most depictors (including many human
ones) have trouble getting these rendered well.

> On a related note, I was keen to manually double-check the
> stereochemistry that had been assigned to each of the chiral centres
> (particularly the ones involving the 9-5 ring connections - as these are
> potentially troublesome), and found myself wishing there was a way to
> easily label a 2D depiction of the molecules with the atom ID.  What I
> ended-up doing was the following:
>
> 1.  Getting the R/S info + atomIdx back from RDKit (example output):
 Chem.FindMolChiralCenters(mol)
> [(3, 'R'), (7, 'R'), (8, 'S'), (9, 'R'), (11, 'R'), (18, 'R'), (24,
> 'R')]
> 2.  Opening the molfile in a program where I know how to label with atom
> IDs (pymol)
> 3.  Check which atom is which manually (had to add 1 to the RDKit
> atomIdx values as they start at 0) then double-check with reference
> values.
>
> RDKit performed admirably - but I presume this is dependant on the
> quality of the wedge info coming in from the SDF(?)

If the data are read from an SDF, yes: the initial stereochem
information comes from the SDF. If you have a 3D SD file, you can also
have the RDKit ignore bond wedging and assign chirality based purely
on coordinates.
R/S assignments are done in a later step; it's always nice to hear
that those are correct.

for what it's worth: I tend to use Marvin Sketch for the "drawing
molecules with atom indices to check up on stereochemistry" task. It
will also assign absolute stereochem to atoms and bonds (usually
correctly), so it's a useful check there too.

Best regards,
-greg

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2011-01-04 Thread James Davidson
Hi Greg,

> On Sat, Dec 18, 2010 at 6:27 AM, Greg Landrum 
>  wrote:
> 
> I just checked in a set of changes that should get this 
> (mostly) working correctly. Here's a demonstration with Geldanamycin:
> 
> In [7]: 
> smi=r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C
> )C\C2=C(/OC)C(=O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'
> 
> In [8]: print Chem.CanonSmiles(smi)
> COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)/C=C(\C)[...@h](OC(N
> )=O)[C@@H](OC)/C=C\C=C(/C)C(=O)NC(=CC1=O)C2=O

Thanks for looking into this so quickly!

> It would be *really* useful to have some more real-world 
> cases like this one to use as tests. So if you happen to have 
> others you can send I would be quite happy to have them.

On that note, I have added a comment to the bug tracker
(https://sourceforge.net/tracker/?func=detail&aid=3139534&group_id=16013
9&atid=814650) - but was not sure how to attach a file (eg sdf) there,
so apologies for it ending up on more lines than I intended...  Also, I
logged in with my google account, but it looks like it may not be clear
who it is!

The first two examples are two marine natural products that only differ
in the geometry of the double bond in the medium ring.  The final
example is a cis- analogue that I synthesised during my PhD for which a
crystal structure was also obtained.  The stereochemistry in these
systems is 'challenging' to say the least, so I thought they would make
reasonable test cases.  I should say that even for the cis- double bond
cases, RDKit does a rather ugly job of the 2D depiction - but I am not
sure if other depictors will perform much better...

On a related note, I was keen to manually double-check the
stereochemistry that had been assigned to each of the chiral centres
(particularly the ones involving the 9-5 ring connections - as these are
potentially troublesome), and found myself wishing there was a way to
easily label a 2D depiction of the molecules with the atom ID.  What I
ended-up doing was the following:

1.  Getting the R/S info + atomIdx back from RDKit (example output):
>>> Chem.FindMolChiralCenters(mol)
[(3, 'R'), (7, 'R'), (8, 'S'), (9, 'R'), (11, 'R'), (18, 'R'), (24,
'R')]
2.  Opening the molfile in a program where I know how to label with atom
IDs (pymol)
3.  Check which atom is which manually (had to add 1 to the RDKit
atomIdx values as they start at 0) then double-check with reference
values.

RDKit performed admirably - but I presume this is dependant on the
quality of the wedge info coming in from the SDF(?)

Kind regards

James

__
PLEASE READ: This email is confidential and may be privileged. It is intended 
for the named addressee(s) only and access to it by anyone else is 
unauthorised. If you are not an addressee, any disclosure or copying of the 
contents of this email or any action taken (or not taken) in reliance on it is 
unauthorised and may be unlawful. If you have received this email in error, 
please notify the sender or postmas...@vernalis.com. Email is not a secure 
method of communication and the Company cannot accept responsibility for the 
accuracy or completeness of this message or any attachment(s). Please check 
this email for virus infection for which the Company accepts no responsibility. 
If verification of this email is sought then please request a hard copy. Unless 
otherwise stated, any views or opinions presented are solely those of the 
author and do not represent those of the Company.

The Vernalis Group of Companies
Oakdene Court
613 Reading Road
Winnersh, Berkshire
RG41 5UA.
Tel: +44 118 977 3133

To access trading company registration and address details, please go to the 
Vernalis website at www.vernalis.com and click on the "Company address and 
registration details" link at the bottom of the page..
__

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2010-12-28 Thread Greg Landrum
On Sat, Dec 18, 2010 at 6:27 AM, Greg Landrum  wrote:
>
> >  For 'classic' aliphatic systems, double-bonds in
> > 3-7-membered rings can only sensibly exist in the cis orientation, so
> > 'ignoring' them would be ok.  However, for 8-membered and above, cis or
> > trans are certainly both possible, so it becomes more important to keep
> > track - particularly if canonical smiles are being used to check for
> > unique structures, as my colleague was doing with the geldanamycin
> > example above.
>
> yeah, that's clear: for larger ring systems the information should be
> preserved. That's very easy to do. The more difficult part is going to
> be making sure the output is actually canonical. I've entered a bug
> for this 
> (https://sourceforge.net/tracker/?func=detail&aid=3139534&group_id=160139&atid=814650)
> and I'll take a look to try and get it fixed (and correct).

I just checked in a set of changes that should get this (mostly)
working correctly. Here's a demonstration with Geldanamycin:

In [7]: 
smi=r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C)C\C2=C(/OC)C(=O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'

In [8]: print Chem.CanonSmiles(smi)
COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)/C=C(\C)[...@h](OC(N)=O)[C@@H](OC)/C=C\C=C(/C)C(=O)NC(=CC1=O)C2=O

At least according to Marvin, those two structures are the same.

One very important caveat: I have not modified the depiction code to
generate correct coordinates for trans bonds in cycles. All
coordinates for ring systems still have all cis bonds. This has an
impact if you write an SD (or mol) file : the stereochemistry captured
in that file will be incorrect. I've entered a bug report for this
(https://sourceforge.net/tracker/?func=detail&aid=3147014&group_id=160139&atid=814650)
so that it doesn't get lost, but I suspect this is going to be a tough
one to fix and not at all sure when it will done.

It would be *really* useful to have some more real-world cases like
this one to use as tests. So if you happen to have others you can send
I would be quite happy to have them.

Best Regards,
-greg

--
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical smiles for medium and large rings?

2010-12-17 Thread Greg Landrum
Dear James,

On Fri, Dec 17, 2010 at 5:35 PM, James Davidson  wrote:
>
> I have been investigating an issue that a colleague of mine identified.
> He was working with the RDKit Canon Smiles node in Knime, and found that
> for the natural product, Geldanamycin, the double-bond geometry
> information was being lost during canonicalisation.  I repeated this
> result outside of knime:
>
> from rdkit import Chem
> from rdkit.Chem import AllChem
>
 smi =
> r'NC(=O)o...@h]1c(/C)=C/[...@h](C)[C@@H](O)[C@@H](OC)c...@h](C)C\C2=C(/OC)C(
> =O)\C=C(\NC(=O)C(\C)=C\C=C/[C@@H]1OC)C2=O'
 AllChem.CanonSmiles(smi)
>
> 'COC1=C2C[C@@H](C)c...@h](OC)[...@h](O)[C@@H](C)C=C(C)[...@h](OC(N)=O)[C@@H](
> OC)C=CC=C(C)C(=O)NC(=CC1=O)C2=O'
>
>
> The simpler example below may be better:
>
 smi1 = r'O1CC/C=C\1' # cyclic ether
 smi2 = r'OCC/C=C\' # corresponding acyclic alcohol
>
 AllChem.CanonSmiles(smi1)
> 'C1C=CCCOCCC1' -> stereochemistry lost
 AllChem.CanonSmiles(smi2)
> '/C=C\\CCO' -> stereochemistry retained
>>
> So, I am guessing that double-bonds in rings are being 'ignored'(?) by
> the canonicaliser?

It's actually being done by the molecule cleanup code that is run when
a molecule is read. The result is, as far as you're concerned, the
same though: there's no stereochemistry on ring double bonds.

>  For 'classic' aliphatic systems, double-bonds in
> 3-7-membered rings can only sensibly exist in the cis orientation, so
> 'ignoring' them would be ok.  However, for 8-membered and above, cis or
> trans are certainly both possible, so it becomes more important to keep
> track - particularly if canonical smiles are being used to check for
> unique structures, as my colleague was doing with the geldanamycin
> example above.

yeah, that's clear: for larger ring systems the information should be
preserved. That's very easy to do. The more difficult part is going to
be making sure the output is actually canonical. I've entered a bug
for this 
(https://sourceforge.net/tracker/?func=detail&aid=3139534&group_id=160139&atid=814650)
and I'll take a look to try and get it fixed (and correct).

It would be helpful to have some additional test cases; I will
generate some, but if you have some examples you could send (or attach
to the bug report) it would be quite helpful.

Thanks for the report,
-greg

--
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Greg Landrum
On Tue, Feb 17, 2009 at 11:17 AM, Noel O'Boyle  wrote:
> 2009/2/17 Andrew Dalke :
>> On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
>>> Does someone know if I can assume that the canonical SMILES of
>>> RDKit are the same as the Open Babel ones?
>
> You can assume they are not the same. No attempt has been made to make
> them consistent.

Correct, it wold require an extremely long series of coincidences to
end up with two canonical smiles implementations that produce
identical output.

>>
>>> Am I doing something wrong in responding to the mailing list, it
>>> looks like all my answers are logged as a separate message as
>>> oposed to being logged in the same thread - please let me know, I
>>> don't want to make it all untidy!
>>
>> I don't use a threaded mail reader so I can't tell.
> I use Gmail and everything is nicely threaded.

ditto: things look fine in gmail.

-greg



Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Greg Landrum
On Fri, Feb 13, 2009 at 11:21 PM, Andrew Dalke
 wrote:
> On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
>> Yes, INnChI is unique across different packages.  This is because
>> there is one definitive source for the code and algorithm.  This was
>> a design goal of InChI.
>
>
> Or to twist TJ's words around .. it's exactly the same as with
> canonical SMILES - every implementation of InChI does it a different
> way. It's just that there's only one InChI implementation.

And since IUPAC has not only done an open implementation with a
reasonable license, but also trademarked the name and placed the
restriction on its use that you can't call it InChI unless you pass
their validate suite, InChI will hopefully remain a "portable"
canonical identifier.

>> in this case probably to do with which branch to deal with first)
>
>
> As I recall when trying to implement the algorithm, the ambiguity is
> in dealing with ties. The algorithm assigns a unique ordering to the
> atoms, up to symmetry, but it's defined at the atom level. Given an
> atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be
> in the same symmetry class, but with different bond types going to B1
> and B2.
>
> I asked Weininger about it and he said "choose the highest order bond
> first", which mostly works but I think can be ambiguous for a few
> rare cases.

I don't recall any. The decision about which bond to follow first at a
branch is really the big one.

> There may be other under-specified aspects. I haven't looked at the
> paper in 10 years.

stereochemistry is one that immediately comes to mind

-greg



Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Noel O'Boyle
2009/2/17 Andrew Dalke :
> On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
>> Does someone know if I can assume that the canonical SMILES of
>> RDKit are the same as the Open Babel ones?

You can assume they are not the same. No attempt has been made to make
them consistent.

> I wouldn't assume that without a lot of testing. My assumption
> is that canonical SMILES generation is so implementation
> sensitive that it's very unlikely two systems would do it the
> same way unless that was a deliberate goal.
>
> Which I know wasn't the case with those two implementations.
>
> I think also that RDKit pays more attention to handling
> stereochemistry than OpenBabel.
>
>> Am I doing something wrong in responding to the mailing list, it
>> looks like all my answers are logged as a separate message as
>> oposed to being logged in the same thread - please let me know, I
>> don't want to make it all untidy!
>
> I don't use a threaded mail reader so I can't tell.
I use Gmail and everything is nicely threaded.

>Andrew
>da...@dalkescientific.com
>
>
>
> --
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>



Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Andrew Dalke

On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
Does someone know if I can assume that the canonical SMILES of  
RDKit are the same as the Open Babel ones?


I wouldn't assume that without a lot of testing. My assumption
is that canonical SMILES generation is so implementation
sensitive that it's very unlikely two systems would do it the
same way unless that was a deliberate goal.

Which I know wasn't the case with those two implementations.

I think also that RDKit pays more attention to handling
stereochemistry than OpenBabel.

Am I doing something wrong in responding to the mailing list, it  
looks like all my answers are logged as a separate message as  
oposed to being logged in the same thread - please let me know, I  
don't want to make it all untidy!


I don't use a threaded mail reader so I can't tell.

Andrew
da...@dalkescientific.com





Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread George Oakman

Hi,

 

Thank you all very much for all the detailed information, the link to the Dr. 
Dobb's article might become very useful.

 

Does someone know if I can assume that the canonical SMILES of RDKit are the 
same as the Open Babel ones?

 

Am I doing something wrong in responding to the mailing list, it looks like all 
my answers are logged as a separate message as oposed to being logged in the 
same thread - please let me know, I don't want to make it all untidy!

 

Thanks.

 
> From: da...@dalkescientific.com
> Date: Fri, 13 Feb 2009 23:21:01 +0100
> To: rdkit-discuss@lists.sourceforge.net
> Subject: Re: [Rdkit-discuss] Canonical SMILES
> 
> On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
> > Yes, INnChI is unique across different packages. This is because
> > there is one definitive source for the code and algorithm. This was
> > a design goal of InChI.
> 
> 
> Or to twist TJ's words around .. it's exactly the same as with 
> canonical SMILES - every implementation of InChI does it a different 
> way. It's just that there's only one InChI implementation.
> 
> >> The book I was referring to is An Introduction to 
> >> Chemoinformatics from A.R. Leach and V.J. Gillet. Yes, they refer 
> >> to the CANGEN algorithm and to the Weininger paper you mentioned.
> >> It doesn't matter, as long as I'm aware of the scope of 
> >> 'uniqueness'.
> 
> Then it's an eerie coincidence that Schneider and Baringhaus use 
> exactly the same example, with exactly the same SMILES. ;)
> 
> http://books.google.com/books?id=feNn- 
> JcC1KgC&pg=PA25&lpg=PA25&dq=canonical 
> +SMILES&source=web&ots=CeTadvKPxA&sig=46za2byYVjkOtYM1cs5- 
> xs6Bch0&hl=en&ei=ia2VSbf1FMyL- 
> gbbguWQCQ&sa=X&oi=book_result&resnum=6&ct=result
> 
> 
> > in this case probably to do with which branch to deal with first)
> 
> 
> As I recall when trying to implement the algorithm, the ambiguity is 
> in dealing with ties. The algorithm assigns a unique ordering to the 
> atoms, up to symmetry, but it's defined at the atom level. Given an 
> atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be 
> in the same symmetry class, but with different bond types going to B1 
> and B2.
> 
> I asked Weininger about it and he said "choose the highest order bond 
> first", which mostly works but I think can be ambiguous for a few 
> rare cases.
> 
> There may be other under-specified aspects. I haven't looked at the 
> paper in 10 years.
> 
> Brian Kelley wrote an article about canonicalization, with code, for 
> Dr. Dobb's magazine. It's online at
> http://www.ddj.com/architect/184405341
> 
> The algorithm isn't that hard to implement, and it can be useful (at 
> very rare times) for doing things like canonicalizing SMARTS.
> 
> 
> Andrew
> da...@dalkescientific.com
> 
> 
> 
> --
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread Andrew Dalke

On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:

Yes, INnChI is unique across different packages.  This is because
there is one definitive source for the code and algorithm.  This was
a design goal of InChI.



Or to twist TJ's words around .. it's exactly the same as with  
canonical SMILES - every implementation of InChI does it a different  
way. It's just that there's only one InChI implementation.


 The book I was referring to is An Introduction to  
Chemoinformatics from A.R. Leach and V.J. Gillet. Yes, they refer  
to the CANGEN algorithm and to the Weininger paper you mentioned.
 It doesn't matter, as long as I'm aware of the scope of  
'uniqueness'.


Then it's an eerie coincidence that Schneider and Baringhaus use  
exactly the same example, with exactly the same SMILES. ;)


http://books.google.com/books?id=feNn- 
JcC1KgC&pg=PA25&lpg=PA25&dq=canonical 
+SMILES&source=web&ots=CeTadvKPxA&sig=46za2byYVjkOtYM1cs5- 
xs6Bch0&hl=en&ei=ia2VSbf1FMyL- 
gbbguWQCQ&sa=X&oi=book_result&resnum=6&ct=result




in this case probably to do with which branch to deal with first)



As I recall when trying to implement the algorithm, the ambiguity is  
in dealing with ties. The algorithm assigns a unique ordering to the  
atoms, up to symmetry, but it's defined at the atom level. Given an  
atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be  
in the same symmetry class, but with different bond types going to B1  
and B2.


I asked Weininger about it and he said "choose the highest order bond  
first", which mostly works but I think can be ambiguous for a few  
rare cases.


There may be other under-specified aspects. I haven't looked at the  
paper in 10 years.


Brian Kelley wrote an article about canonicalization, with code, for  
Dr. Dobb's magazine. It's online at

  http://www.ddj.com/architect/184405341

The algorithm isn't that hard to implement, and it can be useful (at  
very rare times) for doing things like canonicalizing SMARTS.



Andrew
da...@dalkescientific.com





Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread TJ O'Donnell

Hi George

Yes, INnChI is unique across different packages.  This is because
there is one definitive source for the code and algorithm.  This was
a design goal of InChI.

TJ O'Donnell

George Oakman wrote:

Hi,
 
Thanks a lot for the speedy response.
 
Yes, this is what I was suspecting - slightly different conventions (in 
this case probably to do with which branch to deal with first) will lead 
to different results.
 
The book I was referring to is An Introduction to Chemoinformatics from 
A.R. Leach and V.J. Gillet. Yes, they refer to the CANGEN algorithm and 
to the Weininger paper you mentioned.
 
It doesn't matter, as long as I'm aware of the scope of 'uniqueness'.
 
Just out of interest, is the InChi representation more 'unique' across 
different packages than canonical SMILES?
 
Thanks again,
 
George.
 


 > From: da...@dalkescientific.com
 > Date: Fri, 13 Feb 2009 18:38:21 +0100
 > To: rdkit-discuss@lists.sourceforge.net
 > Subject: Re: [Rdkit-discuss] Canonical SMILES
 >
 > On Feb 13, 2009, at 6:20 PM, George Oakman wrote:
 > > One of the first example I have been playing with is the canonical
 > > SMILES for Aspirin.
 > ..
 > >
 > > This gave me the following result:
 > >
 > > CC(Oc1c1C(O)=O)=O
 > >
 > > But I was expecting
 > >
 > > CC(=O)Oc1c1C(=O)O)
 >
 > The canonical SMILES is canonical only on the context of an
 > algorithm. The Daylight algorithm is different than the RDKit one is
 > different from the OpenBabel one is different ... . In fact, the
 > Daylight algorithm has changed over time to fix various problems.
 >
 > When that happens, the molecules need to be re-canonicalized.
 >
 > Even if you go back to the original Weininger paper, there are
 > ambiguities in the description which make the result implementation-
 > specific.
 >
 > Is the book you're using "Molecular Design" by Gisbert Schneider and
 > Karl-Heinz Baringhaus? That came up when I searched for "canonical
 > SMILES" and I see it has example of aspirin with your expected SMILES.
 >
 >
 > Andrew
 > da...@dalkescientific.com
 >
 >
 >
 > 
--
 > Open Source Business Conference (OSBC), March 24-25, 2009, San 
Francisco, CA
 > -OSBC tackles the biggest issue in open source: Open Sourcing the 
Enterprise
 > -Strategies to boost innovation and cut costs with open source 
participation
 > -Receive a $600 discount off the registration fee with the source 
code: SFAD

 > http://p.sf.net/sfu/XcvMzF8H
 > ___
 > Rdkit-discuss mailing list
 > Rdkit-discuss@lists.sourceforge.net
 > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



Windows Live Hotmail just got better. Find out more! 
<http://www.microsoft.com/uk/windows/windowslive/products/hotmail.aspx>





--
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread Noel O'Boyle
2009/2/13 George Oakman :
> Hi,
>
> Thanks a lot for the speedy response.
>
> Yes, this is what I was suspecting - slightly different conventions (in this
> case probably to do with which branch to deal with first) will lead to
> different results.
>
> The book I was referring to is An Introduction to Chemoinformatics from A.R.
> Leach and V.J. Gillet. Yes, they refer to the CANGEN algorithm and to the
> Weininger paper you mentioned.
>
> It doesn't matter, as long as I'm aware of the scope of 'uniqueness'.
>
> Just out of interest, is the InChi representation more 'unique' across
> different packages than canonical SMILES?

I like that - "more unique". :-) You got it in one. With InChI, IUPAC
provide the actual code and we (speaking for OpenBabel here) compile
it. That said, you can pass various options to the InChI code to get
different results and only recently have InChI provided a standard set
of options. (I'm sure I'm glossing over some details here, but this is
how I understand the general picture) So, in short, InChI's should be
the same across different packages as they use the same code.

> Thanks again,
>
> George.
>
>
>> From: da...@dalkescientific.com
>> Date: Fri, 13 Feb 2009 18:38:21 +0100
>> To: rdkit-discuss@lists.sourceforge.net
>> Subject: Re: [Rdkit-discuss] Canonical SMILES
>>
>> On Feb 13, 2009, at 6:20 PM, George Oakman wrote:
>> > One of the first example I have been playing with is the canonical
>> > SMILES for Aspirin.
>> ..
>> >
>> > This gave me the following result:
>> >
>> > CC(Oc1c1C(O)=O)=O
>> >
>> > But I was expecting
>> >
>> > CC(=O)Oc1c1C(=O)O)
>>
>> The canonical SMILES is canonical only on the context of an
>> algorithm. The Daylight algorithm is different than the RDKit one is
>> different from the OpenBabel one is different ... . In fact, the
>> Daylight algorithm has changed over time to fix various problems.
>>
>> When that happens, the molecules need to be re-canonicalized.
>>
>> Even if you go back to the original Weininger paper, there are
>> ambiguities in the description which make the result implementation-
>> specific.
>>
>> Is the book you're using "Molecular Design" by Gisbert Schneider and
>> Karl-Heinz Baringhaus? That came up when I searched for "canonical
>> SMILES" and I see it has example of aspirin with your expected SMILES.
>>
>>
>> Andrew
>> da...@dalkescientific.com
>>
>>
>>
>>
>> --
>> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco,
>> CA
>> -OSBC tackles the biggest issue in open source: Open Sourcing the
>> Enterprise
>> -Strategies to boost innovation and cut costs with open source
>> participation
>> -Receive a $600 discount off the registration fee with the source code:
>> SFAD
>> http://p.sf.net/sfu/XcvMzF8H
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
> 
> Windows Live Hotmail just got better. Find out more!
> --
> Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
> -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
> -Strategies to boost innovation and cut costs with open source participation
> -Receive a $600 discount off the registration fee with the source code: SFAD
> http://p.sf.net/sfu/XcvMzF8H
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>



Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread George Oakman

Hi,
 
Thanks a lot for the speedy response.
 
Yes, this is what I was suspecting - slightly different conventions (in this 
case probably to do with which branch to deal with first) will lead to 
different results.
 
The book I was referring to is An Introduction to Chemoinformatics from A.R. 
Leach and V.J. Gillet. Yes, they refer to the CANGEN algorithm and to the 
Weininger paper you mentioned.
 
It doesn't matter, as long as I'm aware of the scope of 'uniqueness'.
 
Just out of interest, is the InChi representation more 'unique' across 
different packages than canonical SMILES?
 
Thanks again,
 
George.
 
> From: da...@dalkescientific.com> Date: Fri, 13 Feb 2009 18:38:21 +0100> To: 
> rdkit-discuss@lists.sourceforge.net> Subject: Re: [Rdkit-discuss] Canonical 
> SMILES> > On Feb 13, 2009, at 6:20 PM, George Oakman wrote:> > One of the 
> first example I have been playing with is the canonical > > SMILES for 
> Aspirin.> ..> >> > This gave me the following result:> >> > 
> CC(Oc1c1C(O)=O)=O> >> > But I was expecting> >> > CC(=O)Oc1c1C(=O)O)> 
> > The canonical SMILES is canonical only on the context of an > algorithm. 
> The Daylight algorithm is different than the RDKit one is > different from 
> the OpenBabel one is different ... . In fact, the > Daylight algorithm has 
> changed over time to fix various problems.> > When that happens, the 
> molecules need to be re-canonicalized.> > Even if you go back to the original 
> Weininger paper, there are > ambiguities in the description which make the 
> result implementation- > specific.> > Is the book you're using "Molecular 
> Design" by Gisbert Schneider and > Karl-Heinz Baringhaus? That came up when I 
> searched for "canonical > SMILES" and I see it has example of aspirin with 
> your expected SMILES.> > > Andrew> da...@dalkescientific.com> > > > 
> -->
>  Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, 
> CA> -OSBC tackles the biggest issue in open source: Open Sourcing the 
> Enterprise> -Strategies to boost innovation and cut costs with open source 
> participation> -Receive a $600 discount off the registration fee with the 
> source code: SFAD> http://p.sf.net/sfu/XcvMzF8H> 
> ___> Rdkit-discuss mailing list> 
> Rdkit-discuss@lists.sourceforge.net> 
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_
Love Hotmail?  Check out the new services from Windows Live! 
http://clk.atdmt.com/UKM/go/132630768/direct/01/

Re: [Rdkit-discuss] Canonical SMILES

2009-02-13 Thread Andrew Dalke

On Feb 13, 2009, at 6:20 PM, George Oakman wrote:
One of the first example I have been playing with is the canonical  
SMILES for Aspirin.

..


This gave me the following result:

  CC(Oc1c1C(O)=O)=O

But I was expecting

  CC(=O)Oc1c1C(=O)O)


The canonical SMILES is canonical only on the context of an  
algorithm. The Daylight algorithm is different than the RDKit one is  
different from the OpenBabel one is different ... . In fact, the  
Daylight algorithm has changed over time to fix various problems.


When that happens, the molecules need to be re-canonicalized.

Even if you go back to the original Weininger paper, there are  
ambiguities in the description which make the result implementation- 
specific.


Is the book you're using "Molecular Design" by Gisbert Schneider and  
Karl-Heinz Baringhaus? That came up when I searched for "canonical  
SMILES" and I see it has example of aspirin with your expected SMILES.



Andrew
da...@dalkescientific.com