On 10/6/10 9:58 AM, Tim Vandermeersch wrote: >> I was looking through the new canon.cpp code, and it looks like a >> huge improvement over the original, and the tests we've been >> running are confirming this. This is a big step forward. >> >> But I noticed a rather alarming lack of comments in the new code. >> There are some unfinished general comments towards the end of the >> file, but the various data structs and methods aren't explained, >> and I couldn't find anything about the overall algorithms you've >> implemented. > > Yes, I'll update the docs tonight.
Great, I'm looking forward to digging in to the new canonicalizer to see how it works. >> The reason this came up was because of the "c1(ccccc1)O" or "ugly SMILES" >> bug. I was hoping to be able to look at your new code and make some >> suggestions, but after an hour or two of scratching my head I was still >> lost. I hope you'll have some time to write down all of the knowledge you >> put into the new canon.cpp code. > > I think I know how to make the smiles look better again using your > suggestions. Right now, the highest symmetry class is used to start > labeling which tends to be ring atoms, metals, ... Once I commit the > docs, I'll try modify the code and post the resulting smiles. Sounds good. I think all that's needed is to change how you weight the rules you use to assign the initial graph-invarients so that - terminal atoms (single bond) - long chains - low mwt - no charge tend to get the lowest canonical ordering. That would mean better: Oc1ccccc1 worse: c1(O)cccc1 better: Oc1ccccc1[O-] worse: [O-]c1ccccc1O better: OCCC(O)C worse: O(CCCO)C After looking through the smilesformat.cpp again, I'm pretty sure that all canon.cpp needs to do is ensure that the lowest-numbered canonical atom is a good one to start on. After that, the smilesformat.cpp code takes care of choosing a good path. For example, when it hits a ring, it will select a single bond for the ring closure over a double or triple, even if the canonical label of the double- or triple-bonded atom has a lower canonical numbering. That way, it favors C1=CCCCC1 over C=1CCCCC1. So smilesformat.cpp does most of the "beautification." The main trick in canon.cpp is to give it a good starting place. > This > would change the smiles again but perhaps we should just declare > canonical smiles stable when we release 2.3? Yes, that's a good plan. > I didn't do very much the last 3 days since I was sick but I should be > able to get some work done again. I hope you feel better soon. Craig ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel