On Mon, Nov 28, 2016 at 11:31 AM, Christos Kannas <chriskan...@gmail.com>
wrote:

I think it would be better to use a similarity metric based on fingerprints.
>

Hi Christos,

Fingerprints will only work if the fingerprint method you use captures all
of the salient information you're interested in. For example, most
fingerprint metrics in use have spotty or non-existent encoding of
chirality, so if you want to consider two enantiomers to be different,
fingerprint similarity will not work for you. (Unless you happen to pick a
fingerprint method which happens to encode the particular chirality
information you're interested in.)

E.g.

>>> m1 = Chem.MolFromSmiles("CC1=CC[C@](Cl)(CC1)C(=C)C")
> >>> m2 = Chem.MolFromSmiles("CC1=CC[C@@](Cl)(CC1)C(=C)C")
> >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2))
> 1.0
>

Even regioisomers can fool a fingerprint-based method, for certain
regioisomers:

>>> m1 = Chem.MolFromSmiles("N(CCCCCCC[Br])CCCCCCCCO")
> >>> m2 = Chem.MolFromSmiles("N(CCCCCCCO)CCCCCCCC[Br]")
> >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2))
> 1.0
>

(That's 7 versus 8 carbons on each aliphatic chain.)

I agree with Rajarshi that a SMILES based approach will probably work, if
you make sure you properly canonicalize the SMILES.

The default RDKit SMILES output should work for most molecules. RDKit will
canonicalize the SMILES by default (though keep in mind different programs
have different SMILES canonicalization routines, so only compare RDKit
canonical smiles with other RDKit canonical SMILES). Also, RDKit normally
removes hydrogens on structures it reads in, so passing the molecule
through RDKit will give you a SMILES without (non-critical) hydrogens. By
default it will also output things labeled aromatically, so you don't have
to worry about Kekulization differences.

If you care about stereo-isomer differences, the one thing you probably
will want to change from the defaults is to add "isomericSmiles=True" to
the calls to MolToSmiles(), otherwise you'll lose the chirality information
when you write out your SMILES.

Tautomer and charged forms are going to be the big drawback here.
Especially with things like imidazole-like rings, RDKit can be particular
with hydrogen tautomerization, considering them to be different molecules.

>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)cn1"))
>
# Doesn't work: Sanitization error
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)cn1"))
>
'Clc1cnc[nH]1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)c[nH]1"))
>
'Clc1c[nH]cn1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)c[nH+]1"))
>
'Clc1c[nH+]c[nH]1'
>
>>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH+]c(Cl)c[nH]1"))
>
'Clc1c[nH]c[nH+]1'
>

That difference stays even after attempting to remove hydrogens from the
molecule.

Regards,
-Rocco
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to