Hi Cheminformaticians,

This is an extreme subtlety in the interpretation of SMILES atom
stereochemistry and I think a bug in RDKit. Specifically, I think the
following SMILES should be the same molecule:

>>> rdkit.__version__
'2017.09.1'
>>> Chem.CanonSmiles('F[C@@]1(C)CCO1')
'C[C@]1(F)CCO1'
>>> Chem.CanonSmiles('[C@@](F)1(C)CCO1')
'C[C@@]1(F)CCO1'

Since there is no hydrogen inside the stereo carbon atom block the bond
being 'looked down' should be the first atom encountered. In both cases
above, that should be the Florine, therefore the molecules should be
equivalent.

Though it could be argued the 2nd one is not strict SMILES as Andrew
describes here: https://github.com/rdkit/rdkit/issues/786

It is useful when recombining fragments with ring closure digits for these
to be equivalent:
[*][C@]1(C)CCO1
[C@]([*])1(C)CCO1

Also, every other tool I can get my hands on agrees they're the same:
OEChem, OpenBabel, indigo, and ChemAxon. (CDK lacks a simple enough
canonicalization example for me to work from.)

Sure wish there was a SMILES validation test suite we could all run
against. And so I'm attaching the examples I used to verify the above so
whatever poor soul assigned that task later can find this on Google. (I'm
hopeful :-)

Thanks,
Brian

PS: the current output from the script:

$ python stereo_handling_first_atom.py
RDKit = 2017.09.1
OEChem = 2.1.2
OpenBabel = 2.4.1
indigo = 1.2.3.r0-g98188eb mac10.7
RDKit failed to recognize these as the same:
[*:1][C@]1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@]1([*:1])[*:2]
[C@]([*:1])1([*:2])CC1(Cl)Cl -> ClC1(Cl)C[C@@]1([*:1])[*:2]
OpenBabel failed to recognize these as the same:
Cl[S@](C)=O -> C[S@](=O)Cl
[S@](Cl)(C)=O -> C[S@@](=O)Cl
Indigo failed to recognize these as the same:
Cl[S@](C)=O -> C[S@](=O)Cl
[S@](Cl)(C)=O -> C[S@@](=O)Cl
OpenBabel failed to recognize these as the same:
Cl[S@](C)=CCCC -> CCCC=[S@](Cl)C
[S@](Cl)(C)=CCCC -> CCCC=[S@@](Cl)C
Indigo failed to recognize these as the same:
Cl[S@](C)=CCCC -> CCCC=[S@@](C)Cl
[S@](Cl)(C)=CCCC -> CCCC=[S@](C)Cl
RDKit failed to recognize these as the same:
Cl[C@](F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@](F)(Cl)CC1
[C@](Cl)(F)1CC[C@H](F)CC1 -> F[C@H]1CC[C@@](F)(Cl)CC1
RDKit failed to recognize these as the same:
Cl[C@]1(c2ccccc2)NCCCS1 -> Cl[C@]1(c2ccccc2)NCCCS1
[C@](Cl)1(c2ccccc2)NCCCS1 -> Cl[C@@]1(c2ccccc2)NCCCS1
RDKit failed to recognize these as the same:
Cl3.[C@]31(c2ccccc2)NCCCS1 -> Cl[C@]1(c2ccccc2)NCCCS1
[C@](Cl)1(c2ccccc2)NCCCS1 -> Cl[C@@]1(c2ccccc2)NCCCS1
RDKit failed to recognize these as the same:
Cl[C@](F)1C2C(C1)CNC2 -> F[C@@]1(Cl)CC2CNCC21
[C@](Cl)(F)1C2C(C1)CNC2 -> F[C@]1(Cl)CC2CNCC21
RDKit failed to recognize these as the same:
[*][C@@H]1CO1 -> [*][C@@H]1CO1
[C@H]([*])1CO1 -> [*][C@H]1CO1
RDKit failed to recognize these as the same:
[*][C@@]1(C)CCO1 -> [*][C@@]1(C)CCO1
[C@@]([*])1(C)CCO1 -> [*][C@]1(C)CCO1
RDKit failed to recognize these as the same:
F[C@@]1(C)CCO1 -> C[C@]1(F)CCO1
[C@@](F)1(C)CCO1 -> C[C@@]1(F)CCO1
RDKit failed to recognize these as the same:
Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@H](Cl)[C@H]1Cl
[C@H](Cl)1[C@@H](Cl)C(Cl)CCN1 -> ClC1CCN[C@@H](Cl)[C@H]1Cl
from __future__ import print_function
import sys
sys.path.append('/Users/coleb/indigo-python')

import rdkit
from rdkit import Chem

from openeye.oechem import *

import pybel
import openbabel

import indigo

print("RDKit =", rdkit.__version__)
print("OEChem =", OEChemGetRelease())
print("OpenBabel =", openbabel.OBReleaseVersion())
print("indigo =", indigo.Indigo().version())

def indigoCanSmi(smi):
    return indigo.Indigo().loadMolecule(smi).canonicalSmiles()

def OBCanSmi(smi):
    return pybel.readstring("smi", smi).write("can").strip()

def OECanSmi(smi):
    mol = OEGraphMol()
    OESmilesToMol(mol, smi)
    return OEMolToSmiles(mol)

SMILES_THAT_ARE_THE_SAME = [
    # examples from OpenSMILES spec, every toolkits agrees on these at least
    ('N[C@H](O)C',       '[C@@H](N)(O)C'),
    ('N[C@@](Br)(C)O',   'N[C@](Br)(O)C'),
    ('Br[C@](N)(C)O',    'O[C@](Br)(C)N'),

    # examples with attachment points every toolkit agrees on
    ('[*][C@@H](C)N',    '[C@H]([*])(C)N'),
    ('[*][C@@](F)(C)N',  '[C@@]([*])(F)(C)N'),

    # examples from Dalke 2017 UGM talk
    ('[*][C@](N)(O)S', '[C@]([*])(N)(O)S'),
    ('[*][C@H](O)S',   '[C@@H]([*])(O)S'),
    ('[*:1][C@]1([*:2])CC1(Cl)Cl', '[C@]([*:1])1([*:2])CC1(Cl)Cl'), # RDKit thinks these are different, and ChemAxon removes this stereochemistry entirely

    # example from Dalke report here: https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg05296.html
    ('CN[S@](c1ccccc1)=O', 'CN[S@]2=O.c12ccccc1'),

    # non-ring cases with no hydrogen property
    ('Cl[C@](F)(Br)(O)', '[C@](Cl)(F)(Br)(O)'),

    # OpenBabel and Indigo have issues with sulfur chirality
    ('Cl[S@](C)=O',      '[S@](Cl)(C)=O'),    # ChemAxon flips this chirality
    ('Cl[S@](C)=CCCC',   '[S@](Cl)(C)=CCCC'), # ChemAxon removes this chirality entirely

    # ring cases, RDKit thinks all of these are different isomers
    ('Cl[C@](F)1CC[C@H](F)CC1',    '[C@](Cl)(F)1CC[C@H](F)CC1'),
    ('Cl[C@]1(c2ccccc2)NCCCS1',    '[C@](Cl)1(c2ccccc2)NCCCS1'),
    ('Cl3.[C@]31(c2ccccc2)NCCCS1', '[C@](Cl)1(c2ccccc2)NCCCS1'),
    ('Cl[C@](F)1C2C(C1)CNC2',      '[C@](Cl)(F)1C2C(C1)CNC2'),
    ('[*][C@@H]1CO1',              '[C@H]([*])1CO1'),     # with hydrogen property
    ('[*][C@@]1(C)CCO1',           '[C@@]([*])1(C)CCO1'), # without hydrogen property
    ('F[C@@]1(C)CCO1',             '[C@@](F)1(C)CCO1'), # with a real atom to be sure

    # bridge-head ring case, RDKit agrees with all the other toolkits on this case
    ('[*][C@@]12CNC[C@H]1C2',   '[C@@]([*])12CNC[C@H]1C2'),

    # ring cases from https://github.com/rdkit/rdkit/commit/a2fe13f85a89cd66903ca63edeec0f99b61e8185#diff-6c7d3f98ef92f3b9cfd041f772c442f9R29
    ('C1CN[C@](O)(N)1',  'C1CN[C@]1(O)(N)'), # all toolkits agree
    ('C1CN[C@H]1O',      'C1CN[C@@H](O)1'),
    ('C1CN[C@]12(O).N2', 'C1CN[C@](O)12.N2'),
    ('C[C@]1(O)NCC1',    'C[C@@](O)1NCC1'),
    ('C[C@]1(NCC1)O',    'C[C@](NCC1)(O)1'),

    # ring case from OpenSMILES spec
    ('N1[C@H](Cl)[C@@H](Cl)C(Cl)CC1', 'ClC1CCN[C@H](Cl)[C@H]1Cl'),     # all toolkits agree
    ('Cl[C@@H]1[C@@H](Cl)C(Cl)CCN1', '[C@H](Cl)1[C@@H](Cl)C(Cl)CCN1'), # RDKit thinks these are different

    # dot disconnected stereo case all toolkits agree on
    ('Cl3.[C@]31(c2ccccc2)NCCCS1', 'Cl[C@]1(c2ccccc2)NCCCS1'),
]
for left, rght in SMILES_THAT_ARE_THE_SAME:
    rdleft = Chem.CanonSmiles(left)
    rdrght = Chem.CanonSmiles(rght)
    if rdleft != rdrght:
        print("RDKit failed to recognize these as the same:")
        print(left, '\t->\t', rdleft)
        print(rght, '\t->\t', rdrght)

    oeleft = OECanSmi(left)
    oerght = OECanSmi(rght)
    if oeleft != oerght:
        print("OEChem failed to recognize these as the same:")
        print(left, '\t->\t', oeleft)
        print(rght, '\t->\t', oerght)

    obleft = OBCanSmi(left)
    obrght = OBCanSmi(rght)
    if obleft != obrght:
        print("OpenBabel failed to recognize these as the same:")
        print(left, '\t->\t', obleft)
        print(rght, '\t->\t', obrght)

    inleft = indigoCanSmi(left)
    inrght = indigoCanSmi(rght)
    if inleft != inrght:
        print("Indigo failed to recognize these as the same:")
        print(left, '\t->\t', inleft)
        print(rght, '\t->\t', inrght)
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to