On Aug 21, 2019, at 03:42, Francois Berenger <mli...@ligand.eu> wrote: > Unless rdkit has something, I think graph edit distance is the kind > of things for which you have to rely on a good graph library.
Do you know of any (non-chemical) graph library which can handle edits involving the breaking of aromatic bonds in a chemically correct way? I do not. > Also, maybe the string edit distance between the two canonical smiles is a > good enough proxy. This attempt of mine now, to experiment with graph edit distance, came out of a conversation I had last week with someone using string edit distance. I expressed doubt on how "good" the "good enough" was, but was unable to give any concrete details. I earlier wrote: >> For chain bonds, and non-aromatic bonds, it's easy to delete the bond >> and add the correct number of hydrogens to either side. Similarly, for many chain edits, the string edit distance is a decent proxy, as you say. However, has the goodness ever been characterized? Along with a description of how to minimize the problems with string edit distance? Some of the obvious ones are: 1) Chirality and stereochemistry L-alanine and D-alanine have a graph edit distance to alanine with unspecified chirality are 4 and 5, respectively. N[C@H](C)C(=O)O N[C@@H](C)C(=O)O NC(C)C(=O)O This does not seem reasonable. A similar issue occurs with double bond sterochemistry, like F/C=C/F vs. FC=CF. 2) Isotopes Same issue: CN vs. [14CH3]N. 3) Overlapping element symbols c1ccccc1C and c1ccccc1Cl have an edit distance of 1 c1ccccc1C and c1ccccc1Br have an edit distance of 2 There is no chemical sense for those to have different distances. I can think of ways to mitigate some of the effects of #1-3. In particularly, a substitution matrix (or conversion to pharmacophore reduced graphs) can improve #3. 4) Sensitivity to canonicalization order Depending on the canonicalization method, the following two structures either have a string edit distance of 1 or 4, while the graph edit distance is 1. >>> Chem.CanonSmiles("PCCN") 'NCCP' >>> Chem.CanonSmiles("CCN") 'CCN' 5) difficulty in handling ring formation in a meaningful way >>> Chem.CanonSmiles("C1=CC=CC=C1") 'c1ccccc1' >>> Chem.CanonSmiles("C=CC=CC=C") 'C=CC=CC=C' There are no shared string synbols, so the string edit distance is 9, yet the bond edit distance is only 1. It is this last issue that I am particularly concerned with, leading me to ask about how to handle aromatic bonds when computing the graph edit distance. Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss