Re: [OpenBabel-Devel] Canonical SMILES performance...revisited

Tim Vandermeersch Thu, 07 Oct 2010 18:50:15 -0700

On Tue, Oct 5, 2010 at 2:21 AM, Craig A. James <cja...@emolecules.com> wrote:
> On 10/4/10 2:10 PM, Noel O'Boyle wrote:
>> Hello all,
>>
>> Back on the 19/03/2009 I emailed to this list with the subject
>> "Canonical SMILES performance" about a test set of around 18000
>> PubChem 3D structures. I did the following analysis:
>> (1) sdf ->  can
>> (2) sdf ->  smi ->  can
>> (3) diff of (1) and (2)
>>
>> At that time, we had 1424 failures (8%), which wasn't great. According
>> to a later email, the 22x branch finished with 190 failures.
>>
>> I've just redone the analysis - the download from PubChem has changed,
>> but still has 18000 or so molecules
>> (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound_3D/SDF/Conformers_00000001_00025000.sdf.gz)
>>
>> Now we have only 5 failures. Pretty good by any measure.
>>
>> (There were two canonicalisation timeouts...I think we should add an
>> option either to obabel, or to the canonical format, to set the
>> timeout.)
>>
>> obabel failures.sdf -ocan -O sdf_to_can.txt
>> obabel failures.sdf -osmi -O sdf_to_smi.txt
>> obabel -ismi sdf_to_can.txt -ocan smi_to_can.txt
>> diff sdf_to_can.txt sdf_to_smi.txt
>>
>> <  c12=NCCN=c1ncnc2      167
>> <  N12CC[C@@H](CC1)CC2   7527
>> <  c12c3c(cc4c1c1c(nn2)c2c(cc1cc4)cccc2)cccc3    9107
>> <  c12c(c(c[nH]1)C[C@@h]1n3c...@h](C1)CC3)cccc2  21918
>> <  c\1(=c/2\[n+](=O)cccc2)/n(cccc1)[O-]  23699
>> ---
>>> C12=NCCN=C1NCNC2      167
>>> n12c...@h](CC1)CC2    7527
>>> c12c3c(cc4c1c1c([nH][nH]2)c2c(cc1cc4)cccc2)cccc3      9107
>>> c12c(c(c[nH]1)C[C@@H]1N3CC[C@@H](C1)CC3)cccc2 21918
>>> C1(C2[N+](=O)CCCC2)N(CCCC1)[O-]       23699
>>
>> I make it two kekulization problems and two canonicalisation problems
>> (both the same substructure). The fifth structure (23699) is a tough
>> one.
>
> I'm running about 1.2 million structures through the canonicalizer (it's 
> going to take a while even on 6 CPUs!).  After about 50,000 structures, I 
> found just one error, which is quite remarkable.
>
> Here is the SMILES -- both of these are correct, and both are the same 
> molecule:
>
> C(C1C=CCCC1)(C(=O)N/N=C/c1ccc(cc1)C#N)C(=O)N/N=C\c1ccc(cc1)C#N
> C(C1C=CCCC1)(C(=O)N/N=C\c1ccc(cc1)C#N)C(=O)N/N=C/c1ccc(cc1)C#N
>
> http://www.emolecules.com/image?db=549&id=1127873&width=500&height=500
>
> This is an interesting case because it's new -- OB now supports cis/trans N 
> correctly, and the two halves of the molecule are identical except for the 
> cis/trans difference.


It was a simple bug in the a sorting step somewhere. Fixed in svn r4152.

> Craig
>
> ------------------------------------------------------------------------------
> Beautiful is writing same markup. Internet Explorer 9 supports
> standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
> Spend less time writing and  rewriting code and more time creating great
> experiences on the web. Be a part of the beta today.
> http://p.sf.net/sfu/beautyoftheweb
> _______________________________________________
> OpenBabel-Devel mailing list
> OpenBabel-Devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openbabel-devel
>

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
OpenBabel-Devel mailing list
OpenBabel-Devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openbabel-devel

Re: [OpenBabel-Devel] Canonical SMILES performance...revisited

Reply via email to