Re: [Cdk-user] preserve aromaticity on SMILES output

Andrew Dalke Wed, 25 Jun 2025 07:05:43 -0700

Hi John,

> On Jun 25, 2025, at 10:15, John Mayfield <john.wilkinson...@gmail.com> wrote:
> Even if you don't add any listeners there is an overhead of dispatching the 
> edit events so it is better to avoid this.


I will, to use language I learned from WWII submarine fiction, rig for silent 
running.

> Molecule Standard Form
> 
> We (CDK) try to impose very little automation/sanitisation by default, rather 
> than Daylight's dt_mod on/off and RDKit's sanitization it is more similar to 
> OEChem in that the molecule comes out of the readers as they were described 
> in the input.

I can appreciate that. As I recall (it's been years since I looked at the 
OEChem docs), the OEChem docs listed the recommended set of operations for 
those using the low-level API.

For example, my code used to do

  OEParseSmiles(mol, content, canon, strict)
  OEAssignAromaticFlags(mol, aromaticity_model)

They later added a single function call variant:

  OEReadMolFromBytes(mol, oeformat, flavor, gzip, content)

which handles the appropriate steps. This simplified my code as I don't need 
that flexibility.

> We go a little further and don't even do ring perception (is in ring: 
> true/false). Most common formats (SMILES/MOLfile/InChI/CML) will set the 
> hydrogen counts for you but some older formats (PDB/XYZ) will not. 

Is there documentation for the needed steps? I want to make sure I support the 
primary formats correctly.

As for the less common formats, when I added CDK support back in 2021 I tried 
to support the XYZ format, but ended up noting "I can't figure out how to read 
an XYZ file and assign the correct bond types (RebondTool only assigns single 
bonds and FixBondOrdersTool doesn't add them." I also noted "can't get mol2 to 
create a SMILES so only do basic tests".

That said, I don't think mol2 or XYZ format support is all that useful. I 
haven't come anyone using either format for a long time. As I recall, Greg 
Landrum's viewpoint is that people should use Open Babel to convert to a more 
mainstream format.

There are also readers I don't even touch, like Mopac7Reader or ShelXReader. :)

> A Pattern for matching a single SMARTS query against multiple target 
> compounds. The class can be used for efficiently matching many queries 
> against a single target if setPrepare(boolean) is disabled 
> (prepare(IAtomContainer)) should be called manually once for each molecule.

Yes, now that I know what I'm looking at, I can see that getBitFingerprint() 
for both PubchemFingerprinter and MACCSFingerprinter call:

        SmartsPattern.prepare(container);

If I follow the code correctly this means SMARTS-based fingerprinting always 
triggers aromaticity re-perception.

For example, if I use the same molecule to generate both MACCS and Pubchem 
fingerprints then both will do:

  Cycles.markRingAtomsAndBonds(target);
  Aromaticity.apply(Aromaticity.Model.Daylight, target);

even if input processing has already done this step.

It also means that if input processing uses a different model, like 
Aromaticity.Model.Mdl (picking one available from that class), then I need to 
pass a copy to the fingerprinter if I don't want the assignments to possibly 
change.

> If you have multiple patterns to match what you want to do is something like 
> this:
> 
> 0. patterns <- load SMARTS/prepare patterns, set prepare false
> 1. Read Molecule (mol)
> 2. Set ring flags
> 3. Set aromaticity 
> 4. for pat in patterns: pat.match(mol)
> 
> Steps 2/3 can be replaced with prepare, if you have pre-calculated and store 
> aromaticity (e.g. in SMILES) then you can skip step 3 as the input 
> aromaticity flags will be preserved.

Because of the chemfp design, my input reader doesn't know if the created 
molecules will be used for fingerprinting or for format conversion, so I need 
to alway do 2 and 3.

I also don't have a way to distinguish between the built-in CDK fingerprint 
types which always prepare, and my own fingerprint types which expect prepared 
molecules.

I think this means, at least for chemfp, that I should always prepare the 
molecules as I read them, using the Daylight model, so that my own fingerprint 
types can assume the inputs are always properly prepared.

 
> Sorry I meant if you knew the steps to reproduce/which aromaticity model did 
> you use..? The standard Daylight model used by the SMARTS matcher would find 
> the externeral porphyrin ring aromatic hence I'm not sure how you would get 
> that unless you used a different aromaticity model (e.g. tighter ring set) 
> before writing to SMILES.

The problem is that I didn't use any explicit aromaticity perception.

Here's my reproducible:

=============
import jpype  # Must install JPype to interface to the CDK jar
import jpype.imports # configure the import hooks
import jpype.nio

jpype.startJVM(None, '-Djava.awt.headless=true')

from org.openscience import cdk
from org.openscience.cdk.smiles import (
    SmilesParser,
    SmilesGenerator,
    SmiFlavor)

smiles = (
    "OCCO[P+]1(OCCO)n2c3ccc2/C(c2ccccc2)=C2/C=CC(=N2)/C(c2ccccc2)"
    "=c2/cc/c(n21)=C(\c1ccccc1)C1=NC(=C3c2ccccc2)C=C1 CHEMBL2369103")

_default_builder = cdk.DefaultChemObjectBuilder.getInstance()

smiles_parser = SmilesParser(_default_builder)
mol = smiles_parser.parseSmiles(smiles)

if 0:
    # Missing perception
    from org.openscience.cdk.graph import Cycles
    from org.openscience.cdk.aromaticity import Aromaticity
    Cycles.markRingAtomsAndBonds(mol)
    Aromaticity.apply(Aromaticity.Model.Daylight, mol)

for flavor_name, flavor in (
        ("Default", SmiFlavor.Default),
        ("Default|UseAromaticSymbols", SmiFlavor.Default | 
SmiFlavor.UseAromaticSymbols),
        ):
    smiles_generator = SmilesGenerator(flavor)
    out_smiles = str(smiles_generator.create(mol))
    print(f"-- {flavor_name}:")
    print(out_smiles)
    print()
=============

The above prints

-- Default:
OCCO[P+]1(OCCO)N2C3=CC=C2/C(/C4=CC=CC=C4)=C\5/C=CC(=N5)C(C6=CC=CC=C6)=C7C=CC(N71)=C(C8=CC=CC=C8)C9=NC(=C3C%10=CC=CC=C%10)C=C9

-- Default|UseAromaticSymbols:
OCCO[P+]1(OCCO)n2c3ccc2/C(/c4ccccc4)=C\5/C=CC(=N5)C(c6ccccc6)=c7ccc(n71)=C(c8ccccc8)C9=NC(=C3c%10ccccc%10)C=C9

With the missing perception step enabled (change the "if 0:" to "if 1:") then I 
get what I expected from using CDK Depict.

-- Default:
OCCO[P+]1(OCCO)N2C3=CC=C2/C(/C4=CC=CC=C4)=C\5/C=CC(=N5)C(C6=CC=CC=C6)=C7C=CC(N71)=C(C8=CC=CC=C8)C9=NC(=C3C%10=CC=CC=C%10)C=C9

-- Default|UseAromaticSymbols:
OCCO[P+]1(OCCO)n2c3ccc2c(-c4ccccc4)c5C=Cc(n5)c(-c6ccccc6)c7ccc(n71)c(-c8ccccc8)c9nc(c3-c%10ccccc%10)C=C9



> Hopefully that covers everything but let me know if you have any more 
> questions/thoughts.

I think it does. Thanks!

                                        Andrew
                                        da...@dalkescientific.com




_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] preserve aromaticity on SMILES output

Reply via email to