On Aug 21, 2019, at 03:42, Francois Berenger <mli...@ligand.eu> wrote:
> Unless rdkit has something, I think graph edit distance is the kind
> of things for which you have to rely on a good graph library.

Do you know of any (non-chemical) graph library which can handle edits 
involving the breaking of aromatic bonds in a chemically correct way? I do not.

> Also, maybe the string edit distance between the two canonical smiles is a 
> good enough proxy.

This attempt of mine now, to experiment with graph edit distance, came out of a 
conversation I had last week with someone using string edit distance. I 
expressed doubt on how "good" the "good enough" was, but was unable to give any 
concrete details.

I earlier wrote:
>> For chain bonds, and non-aromatic bonds, it's easy to delete the bond
>> and add the correct number of hydrogens to either side.

Similarly, for many chain edits, the string edit distance is a decent proxy, as 
you say.

However, has the goodness ever been characterized? Along with a description of 
how to minimize the problems with string edit distance? Some of the obvious 
ones are:

1) Chirality and stereochemistry

L-alanine and D-alanine have a graph edit distance to alanine with unspecified 
chirality are 4 and 5, respectively. 

  N[C@H](C)C(=O)O
  N[C@@H](C)C(=O)O
  NC(C)C(=O)O

This does not seem reasonable. A similar issue occurs with double bond 
sterochemistry, like F/C=C/F vs. FC=CF.

2) Isotopes

Same issue: CN vs. [14CH3]N.

3) Overlapping element symbols

c1ccccc1C and c1ccccc1Cl have an edit distance of 1
c1ccccc1C and c1ccccc1Br have an edit distance of 2

There is no chemical sense for those to have different distances.

I can think of ways to mitigate some of the effects of #1-3. In particularly, a 
substitution matrix (or conversion to pharmacophore reduced graphs) can improve 
#3.

4) Sensitivity to canonicalization order

Depending on the canonicalization method, the following two structures either 
have a string edit distance of 1 or 4, while the graph edit distance is 1.

>>> Chem.CanonSmiles("PCCN")
'NCCP'
>>> Chem.CanonSmiles("CCN")
'CCN'


5) difficulty in handling ring formation in a meaningful way

>>> Chem.CanonSmiles("C1=CC=CC=C1")
'c1ccccc1'
>>> Chem.CanonSmiles("C=CC=CC=C")
'C=CC=CC=C'

There are no shared string synbols, so the string edit distance is 9, yet the 
bond edit distance is only 1.

It is this last issue that I am particularly concerned with, leading me to ask 
about how to handle aromatic bonds when computing the graph edit distance.


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to