Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Andrew Dalke Thu, 09 Nov 2017 11:15:56 -0800

On Nov 9, 2017, at 16:09, Brian Cole <[email protected]> wrote:
> Here's an example of why this is useful at maintaining molecular 
> fragmentation inside your molecular representation:
> 
>  >>> from rdkit import Chem
>  >>> smiles = 'F9.[C@]91(C)CCO1'
>  >>> fluorine, core = smiles.split('.')
>  >>> fluorine
>  'F9'
>  >>> fragment = core.replace('9', '([*:9])')

Somehow you got the code to generate a "9" for that ring closure, which is not 
something that RDKit does naturally, so we are only seeing a step in the larger 
part of your goal.

The step you gave does a number of transformations to convert:

  [C@]91(C)CCO1

so the 4th atom has an '8' as an attachment point, that is:

  [C@]91(C)CC8O1

Since you are already comfortable manipulating the SMILES string directly, a 
faster solution is to bypass the toolkit and manipulate the SMILES directly, as 
in:

########
import re

# Match the SMILES for an atom, followed by its closures
atom_pattern = re.compile(r"""
(
 Cl? |             # Cl and Br are part of the organic subset
 Br? |
 [NOSPFIbcnosp*] |  # as are these single-letter elements
 \[[^]]*\]         # everything else must be in []s
)
""", re.X)

smiles = 'F9.[C@]91(C)CCO1'
fluorine, core = smiles.split('.')
matches = list(atom_pattern.finditer(core))
m = matches[3]
new_core = core[:m.end()] + "8" + core[m.end():]
print(new_core)
########

Also, this:

  >>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)

is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that 
the nth atom term in the input SMILES is the same as the nth identifier. It's 
close, but, for example, explicit '[H]' atoms are usually turned into implicit 
hydrogen counts.

Finally, there's another assumption in:
  >>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')

Sometimes the result will not be inside of ()s. For example, the same 
transformation on:

  F9.[C@]91(C)C(C)O1

produces a new_core of:

  C[C@@]19OC1C[*:8]

when you want it to produce:

  C[C@@]19OC1C8

For what it's worth, the re-based version generates:

  [C@]91(C)C(C8)O1

On Nov 9, 2017, at 16:27, Chris Earnshaw <[email protected]> wrote:
> Trouble is, you're mixing chemical operations and lexical ones.

Agreed.

> I've written code in the past to do this kind of thing for virtual
> library building, using dummy atoms to mark link positions in the
> fragments, and using Perl code to transform between the dummy atoms
> and bond-closure numbers to give text strings which could be assembled
> to give valid dot-disconnected SMILES. This required additional
> lexical transformations in order to maintain valid SMILES depending on
> where the dummy atom was, and to make sure that stereochemistry worked
> properly. If you want to do this kind of thing I don't think you can
> expect to avoid these additional lexical operations.

This is exactly what mmpdb does, although in Python code. If anyone is 
interested, see 
https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py .

Cheers,

                                Andrew
                                [email protected]

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] RDKit appears to be parsing SMILES stereochemistry differently

Reply via email to