On 10/6/10 9:58 AM, Tim Vandermeersch wrote:
>> I was looking through the new canon.cpp code, and it looks like a
>>  huge improvement over the original, and the tests we've been
>>  running are confirming this.  This is a big step forward.
>>
>> But I noticed a rather alarming lack of comments in the new code.
>>  There are some unfinished general comments towards the end of the
>>  file, but the various data structs and methods aren't explained,
>>  and I couldn't find anything about the overall algorithms you've 
>> implemented.
>
> Yes, I'll update the docs tonight.

Great, I'm looking forward to digging in to the new canonicalizer to see how it 
works.

>> The reason this came up was because of the "c1(ccccc1)O" or "ugly SMILES" 
>> bug.  I was hoping to be able to look at your new code and make some 
>> suggestions, but after an hour or two of scratching my head I was still 
>> lost.  I hope you'll have some time to write down all of the knowledge you 
>> put into the new canon.cpp code.
>
> I think I know how to make the smiles look better again using your
> suggestions. Right now, the highest symmetry class is used to start
> labeling which tends to be ring atoms, metals, ... Once I commit the
> docs, I'll try modify the code and post the resulting smiles.

Sounds good.  I think all that's needed is to change how you weight the rules 
you use to assign the initial graph-invarients so that

   - terminal atoms (single bond)
   - long chains
   - low mwt
   - no charge

tend to get the lowest canonical ordering.  That would mean

   better: Oc1ccccc1
   worse:  c1(O)cccc1

   better: Oc1ccccc1[O-]
   worse:  [O-]c1ccccc1O

   better: OCCC(O)C
   worse:  O(CCCO)C

After looking through the smilesformat.cpp again, I'm pretty sure that all 
canon.cpp needs to do is ensure that the lowest-numbered canonical atom is a 
good one to start on.

After that, the smilesformat.cpp code takes care of choosing a good path.  For 
example, when it hits a ring, it will select a single bond for the ring closure 
over a double or triple, even if the canonical label of the double- or 
triple-bonded atom has a lower canonical numbering.  That way, it favors 
C1=CCCCC1 over C=1CCCCC1.

So smilesformat.cpp does most of the "beautification." The main trick in 
canon.cpp is to give it a good starting place.

> This
> would change the smiles again but perhaps we should just declare
> canonical smiles stable when we release 2.3?

Yes, that's a good plan.

> I didn't do very much the last 3 days since I was sick but I should be
> able to get some work done again.

I hope you feel better soon.

Craig

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Reply via email to