On Nov 9, 2017, at 16:09, Brian Cole <[email protected]> wrote:
> Here's an example of why this is useful at maintaining molecular
> fragmentation inside your molecular representation:
>
> >>> from rdkit import Chem
> >>> smiles = 'F9.[C@]91(C)CCO1'
> >>> fluorine, core = smiles.split('.')
> >>> fluorine
> 'F9'
> >>> fragment = core.replace('9', '([*:9])')
Somehow you got the code to generate a "9" for that ring closure, which is not
something that RDKit does naturally, so we are only seeing a step in the larger
part of your goal.
The step you gave does a number of transformations to convert:
[C@]91(C)CCO1
so the 4th atom has an '8' as an attachment point, that is:
[C@]91(C)CC8O1
Since you are already comfortable manipulating the SMILES string directly, a
faster solution is to bypass the toolkit and manipulate the SMILES directly, as
in:
########
import re
# Match the SMILES for an atom, followed by its closures
atom_pattern = re.compile(r"""
(
Cl? | # Cl and Br are part of the organic subset
Br? |
[NOSPFIbcnosp*] | # as are these single-letter elements
\[[^]]*\] # everything else must be in []s
)
""", re.X)
smiles = 'F9.[C@]91(C)CCO1'
fluorine, core = smiles.split('.')
matches = list(atom_pattern.finditer(core))
m = matches[3]
new_core = core[:m.end()] + "8" + core[m.end():]
print(new_core)
########
Also, this:
>>> mol.AddBond(idx, 4, Chem.rdchem.BondType.SINGLE)
is a piece of magic. Where does the 4 come from? RDKit doesn't guarantee that
the nth atom term in the input SMILES is the same as the nth identifier. It's
close, but, for example, explicit '[H]' atoms are usually turned into implicit
hydrogen counts.
Finally, there's another assumption in:
>>> new_core = new_core.replace('([*:9])', '9').replace('([*:8])', '8')
Sometimes the result will not be inside of ()s. For example, the same
transformation on:
F9.[C@]91(C)C(C)O1
produces a new_core of:
C[C@@]19OC1C[*:8]
when you want it to produce:
C[C@@]19OC1C8
For what it's worth, the re-based version generates:
[C@]91(C)C(C8)O1
On Nov 9, 2017, at 16:27, Chris Earnshaw <[email protected]> wrote:
> Trouble is, you're mixing chemical operations and lexical ones.
Agreed.
> I've written code in the past to do this kind of thing for virtual
> library building, using dummy atoms to mark link positions in the
> fragments, and using Perl code to transform between the dummy atoms
> and bond-closure numbers to give text strings which could be assembled
> to give valid dot-disconnected SMILES. This required additional
> lexical transformations in order to maintain valid SMILES depending on
> where the dummy atom was, and to make sure that stereochemistry worked
> properly. If you want to do this kind of thing I don't think you can
> expect to avoid these additional lexical operations.
This is exactly what mmpdb does, although in Python code. If anyone is
interested, see
https://github.com/rdkit/mmpdb/blob/master/mmpdblib/smiles_syntax.py .
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss