On May 25, 2023, at 21:13, Tom Hubbard <[email protected]> wrote:
> An InChI could not be generated and used to canonise SMILES: null
>
> Could not generate InChI Numbers: Too many atoms [did you forget
> 'LargeMolecules' switch?]
CDK uses InChI to generate absolute SMILES. Here's a comment from the code:
* Create a absolute SMILES generator. Unique SMILES uses the InChI to
* canonise SMILES and encodes isotope or stereo-chemistry. The InChI
* module is not a dependency of the SMILES module but should be present
* on the classpath when generation absolute SMILES.
If you remove either the SmiFlavor.Canonical or the SmiFlavor.Isomeric bit flag
from your output flavor then you'll get a SMILES, though it won't be an
absolute SMILES.
More specifically, CDK uses InChI to generate the atom labels used during
canonical SMILES generation, in cdk/smiles/SmilesGenerator.java there's a code
path which looks like:
// apply the canonical labelling
if (SmiFlavor.isSet(flavour, SmiFlavor.Canonical)) {
// determine the output order
int[] labels = labels(flavour, molecule);
where the labels() is:
private static int[] labels(int flavour, final IAtomContainer molecule)
throws CDKException {
// FIXME: use SmiOpt.InChiLabelling
long[] labels = SmiFlavor.isSet(flavour, SmiFlavor.Isomeric) ?
inchiNumbers(molecule)
: Canon.label(molecule,
GraphUtil.toAdjList(molecule),
createComparator(molecule, flavour));
Thus, if SmiFlavor.Canonical and SmiFlavor.Isomeric are set, it ends up using
code in cdk/graph/invariant/InChINumbersTools.java which configures InChI to
do the atom order assignments, via the 'auxiliary information':
public static long[] getNumbers(IAtomContainer atomContainer) throws
CDKException {
String aux = auxInfo(atomContainer, new InchiFlag[0]);
...
static String auxInfo(IAtomContainer container, InchiFlag... flags) throws
CDKException {
InChIGeneratorFactory factory = InChIGeneratorFactory.getInstance();
boolean org = factory.getIgnoreAromaticBonds();
factory.setIgnoreAromaticBonds(true);
InChIGenerator gen = factory.getInChIGenerator(container, flags);
factory.setIgnoreAromaticBonds(org); // an option on the singleton so
we should reset for others
if (gen.getStatus() == InchiStatus.ERROR)
throw new CDKException("Could not generate InChI Numbers: " +
gen.getMessage());
return gen.getAuxInfo();
That calls into the InChI, which has the check (actually, it's in a few places,
all with the same idea):
max_num_at = ip->bLargeMolecules ? MAX_ATOMS :
NORMALLY_ALLOWED_INP_MAX_ATOMS;
if (nNumAtoms >= max_num_at)
{
TREAT_ERR( *err, 0, "Too many atoms [did you forget 'LargeMolecules'
switch?]" );
*err = 70;
orig_inp_data->num_inp_atoms = -1;
goto err_exit;
}
where
#define MAX_ATOMS 32766
#define NORMALLY_ALLOWED_INP_MAX_ATOMS 1024
The InChI flag is enabled with the flag 'LargeMolecules',
https://github.com/dan2097/jna-inchi/blob/master/jna-inchi-api/src/main/java/io/github/dan2097/jnainchi/InchiFlag.java#L47
/** Allows input of molecules up to 32767 atoms [Produces 'InChI=1B' indicating
beta status of resulting identifiers]*/
so it appears that changing cdk/graph/invariant/InChINumbersTools.java line 49
from:
String aux = auxInfo(atomContainer, new InchiFlag[0]);
to have LargeMolecules in that 'new InchiFlag' would make this work.
However, I'm not a Java developer and don't know how to make this change nor
test it. I can say it does not seem to be user-configurable.
I am a Python developer, and I can reproduce the error using my 'chemfp
translate' tool, which uses a Java/Python bridge to work with the CDK. The
following uses RDKit to translate a FASTA sequence to an SDF with 1079 atoms:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit
--in fasta --out sdf | head -6
megatryp
RDKit
0 0 0 0 0 0 0 0 0 0999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1079 1232 0 0 0
I can have it go from FASTA to SDF using RDKit then have CDK read the SDF to
produce the SMILES generation failure:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit
--in fasta --via sdf -U cdk --out smi
Error: CDK cannot create the SMILES string (input title='megatryp'): An InChI
could not be generated and used to canonise SMILES: null, file '<stdin>', line
1, record #1: first line is '>megatryp'. Skipping.
(the --via defaults to 'sdf' so I'll omit that in the rest).
I can configure CDK SMILES writer to use the Default flavor, but without the
'Canonical' option, to show that work-around gives a (non-canonical) SMILES:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit
--in fasta -U cdk --out smi -W flavor=Default,-Canonical | fold | head -2
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)NC(C(=O)
Here I'll disable Isomeric instead, so it should be canonical but not isomeric,
which might be okay for you:
% python -c 'print(">megatryp\n" + "W"*77 + "\n")' | chemfp translate -T rdkit
--in fasta -U cdk --out smi -W flavor=Default,-Isomeric | fold | head -2
O=C(O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(NC(=O)C(
That's the flavor you pass into SmilesGenerator().
Cheers,
Andrew
[email protected]
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user