On Mon, Nov 28, 2016 at 11:31 AM, Christos Kannas <chriskan...@gmail.com> wrote:
I think it would be better to use a similarity metric based on fingerprints. > Hi Christos, Fingerprints will only work if the fingerprint method you use captures all of the salient information you're interested in. For example, most fingerprint metrics in use have spotty or non-existent encoding of chirality, so if you want to consider two enantiomers to be different, fingerprint similarity will not work for you. (Unless you happen to pick a fingerprint method which happens to encode the particular chirality information you're interested in.) E.g. >>> m1 = Chem.MolFromSmiles("CC1=CC[C@](Cl)(CC1)C(=C)C") > >>> m2 = Chem.MolFromSmiles("CC1=CC[C@@](Cl)(CC1)C(=C)C") > >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2)) > 1.0 > Even regioisomers can fool a fingerprint-based method, for certain regioisomers: >>> m1 = Chem.MolFromSmiles("N(CCCCCCC[Br])CCCCCCCCO") > >>> m2 = Chem.MolFromSmiles("N(CCCCCCCO)CCCCCCCC[Br]") > >>> FingerprintSimilarity(FingerprintMol(m1),FingerprintMol(m2)) > 1.0 > (That's 7 versus 8 carbons on each aliphatic chain.) I agree with Rajarshi that a SMILES based approach will probably work, if you make sure you properly canonicalize the SMILES. The default RDKit SMILES output should work for most molecules. RDKit will canonicalize the SMILES by default (though keep in mind different programs have different SMILES canonicalization routines, so only compare RDKit canonical smiles with other RDKit canonical SMILES). Also, RDKit normally removes hydrogens on structures it reads in, so passing the molecule through RDKit will give you a SMILES without (non-critical) hydrogens. By default it will also output things labeled aromatically, so you don't have to worry about Kekulization differences. If you care about stereo-isomer differences, the one thing you probably will want to change from the defaults is to add "isomericSmiles=True" to the calls to MolToSmiles(), otherwise you'll lose the chirality information when you write out your SMILES. Tautomer and charged forms are going to be the big drawback here. Especially with things like imidazole-like rings, RDKit can be particular with hydrogen tautomerization, considering them to be different molecules. >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)cn1")) > # Doesn't work: Sanitization error > >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)cn1")) > 'Clc1cnc[nH]1' > >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1nc(Cl)c[nH]1")) > 'Clc1c[nH]cn1' > >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH]c(Cl)c[nH+]1")) > 'Clc1c[nH+]c[nH]1' > >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1[nH+]c(Cl)c[nH]1")) > 'Clc1c[nH]c[nH+]1' > That difference stays even after attempting to remove hydrogens from the molecule. Regards, -Rocco
------------------------------------------------------------------------------
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss