On Tue, Oct 5, 2010 at 2:21 AM, Craig A. James <cja...@emolecules.com> wrote: > On 10/4/10 2:10 PM, Noel O'Boyle wrote: >> Hello all, >> >> Back on the 19/03/2009 I emailed to this list with the subject >> "Canonical SMILES performance" about a test set of around 18000 >> PubChem 3D structures. I did the following analysis: >> (1) sdf -> can >> (2) sdf -> smi -> can >> (3) diff of (1) and (2) >> >> At that time, we had 1424 failures (8%), which wasn't great. According >> to a later email, the 22x branch finished with 190 failures. >> >> I've just redone the analysis - the download from PubChem has changed, >> but still has 18000 or so molecules >> (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/SDF/Conformers_00000001_00025000.sdf.gz) >> >> Now we have only 5 failures. Pretty good by any measure. >> >> (There were two canonicalisation timeouts...I think we should add an >> option either to obabel, or to the canonical format, to set the >> timeout.) >> >> obabel failures.sdf -ocan -O sdf_to_can.txt >> obabel failures.sdf -osmi -O sdf_to_smi.txt >> obabel -ismi sdf_to_can.txt -ocan smi_to_can.txt >> diff sdf_to_can.txt sdf_to_smi.txt >> >> < c12=NCCN=c1ncnc2 167 >> < N12CC[C@@H](CC1)CC2 7527 >> < c12c3c(cc4c1c1c(nn2)c2c(cc1cc4)cccc2)cccc3 9107 >> < c12c(c(c[nH]1)C[C@@h]1n3c...@h](C1)CC3)cccc2 21918 >> < c\1(=c/2\[n+](=O)cccc2)/n(cccc1)[O-] 23699 >> --- >>> C12=NCCN=C1NCNC2 167 >>> n12c...@h](CC1)CC2 7527 >>> c12c3c(cc4c1c1c([nH][nH]2)c2c(cc1cc4)cccc2)cccc3 9107 >>> c12c(c(c[nH]1)C[C@@H]1N3CC[C@@H](C1)CC3)cccc2 21918 >>> C1(C2[N+](=O)CCCC2)N(CCCC1)[O-] 23699 >> >> I make it two kekulization problems and two canonicalisation problems >> (both the same substructure). The fifth structure (23699) is a tough >> one. > > I'm running about 1.2 million structures through the canonicalizer (it's > going to take a while even on 6 CPUs!). After about 50,000 structures, I > found just one error, which is quite remarkable. > > Here is the SMILES -- both of these are correct, and both are the same > molecule: > > C(C1C=CCCC1)(C(=O)N/N=C/c1ccc(cc1)C#N)C(=O)N/N=C\c1ccc(cc1)C#N > C(C1C=CCCC1)(C(=O)N/N=C\c1ccc(cc1)C#N)C(=O)N/N=C/c1ccc(cc1)C#N > > http://www.emolecules.com/image?db=549&id=1127873&width=500&height=500 > > This is an interesting case because it's new -- OB now supports cis/trans N > correctly, and the two halves of the molecule are identical except for the > cis/trans difference.
It was a simple bug in the a sorting step somewhere. Fixed in svn r4152. > Craig > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > OpenBabel-Devel mailing list > OpenBabel-Devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/openbabel-devel > ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel