Re: [OpenBabel-Devel] SMILES bug

Tim Vandermeersch Thu, 07 Oct 2010 18:06:57 -0700

Hi,

On Thu, Oct 7, 2010 at 12:25 AM, Craig A. James <cja...@emolecules.com> wrote:
> On 10/6/10 9:58 AM, Tim Vandermeersch wrote:
>>>
>>> I was looking through the new canon.cpp code, and it looks like a
>>>  huge improvement over the original, and the tests we've been
>>>  running are confirming this.  This is a big step forward.
>>>
>>> But I noticed a rather alarming lack of comments in the new code.
>>>  There are some unfinished general comments towards the end of the
>>>  file, but the various data structs and methods aren't explained,
>>>  and I couldn't find anything about the overall algorithms you've
>>> implemented.
>>
>> Yes, I'll update the docs tonight.
>
> Great, I'm looking forward to digging in to the new canonicalizer to see how
> it works.


The documentation is updated. All the key elements needed to reproduce
the canonical code are documented I think. The algorithm is very
similar to the original Morgan canonical coding algorithm which
follows the graph invariants atom partitioning step (Extended
Connectivity). This is documented in books that are available on
google books. We also implement most "extensions" including improved
initial graph invariants, faster converging iteration and
stereochemistry.

The canonical labeling part of the algorithm as described by Morgan is
slow since it has to visit each node in the search tree. A simple
optimization (1 in the docs) improves the performance significantly
but it also changes the topology of the search tree (i.e. we don't
detect redundancy but actually make the tree smaller). This is not
really a problem since the end results are still canonical but not
implementing this would lead to different results. Optimization 2 in
the docs is a more general version of 1 but not implemented yet.
(Should I implement this?) I've documented this in case we ever want
to have a canonical OpenSmiles specification although this would also
have to specify parts implemented in smilesformat.cpp.

Another approach to optimizing this kind of algorithm is used by
nauty, saucy and bliss. All these optimizations identify redundant
parts of the search tree to prune. All of these optimizations (there
are several) can be implemented without changing the final result.
Most of these papers are not "easy" to read though. The nauty paper is
referenced in the doc with download link. The good news is that these
optimizations to detect redundant parts nicely complement the smaller
tree optimizations. Implementing the nauty optimizations will probably
be something for 2.3.1.

>>> The reason this came up was because of the "c1(ccccc1)O" or "ugly SMILES"
>>> bug.  I was hoping to be able to look at your new code and make some
>>> suggestions, but after an hour or two of scratching my head I was still
>>> lost.  I hope you'll have some time to write down all of the knowledge you
>>> put into the new canon.cpp code.
>>
>> I think I know how to make the smiles look better again using your
>> suggestions. Right now, the highest symmetry class is used to start
>> labeling which tends to be ring atoms, metals, ... Once I commit the
>> docs, I'll try modify the code and post the resulting smiles.
>
> Sounds good.  I think all that's needed is to change how you weight the
> rules you use to assign the initial graph-invarients so that
>
>  - terminal atoms (single bond)
>  - long chains
>  - low mwt
>  - no charge
>
> tend to get the lowest canonical ordering.  That would mean
>
>  better: Oc1ccccc1
>  worse:  c1(O)cccc1
>
>  better: Oc1ccccc1[O-]
>  worse:  [O-]c1ccccc1O
>
>  better: OCCC(O)C
>  worse:  O(CCCO)C
>
> After looking through the smilesformat.cpp again, I'm pretty sure that all
> canon.cpp needs to do is ensure that the lowest-numbered canonical atom is a
> good one to start on.
>
> After that, the smilesformat.cpp code takes care of choosing a good path.
>  For example, when it hits a ring, it will select a single bond for the ring
> closure over a double or triple, even if the canonical label of the double-
> or triple-bonded atom has a lower canonical numbering.  That way, it favors
> C1=CCCCC1 over C=1CCCCC1.
>
> So smilesformat.cpp does most of the "beautification." The main trick in
> canon.cpp is to give it a good starting place.

There is now a findStartSymmetryClass function where these criteria
can be added. Two minor changes (start with lowest symmetry class, no
formal charge) changes these:

c1(ccccc1)O     
c1(c(cccc1)O)[O-]       
C(CCO)(O)C      

to:

Oc1ccccc1       
Oc1c(cccc1)[O-] 
OCCC(O)C        

These changes are be uncommented in the code.

>> This
>> would change the smiles again but perhaps we should just declare
>> canonical smiles stable when we release 2.3?
>
> Yes, that's a good plan.
>
>> I didn't do very much the last 3 days since I was sick but I should be
>> able to get some work done again.
>
> I hope you feel better soon.

Thanks, it's not very serious though :)

Tim

> Craig
>

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] SMILES bug

Reply via email to