Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread George Oakman

Hi,

 

Thank you all very much for all the detailed information, the link to the Dr. 
Dobb's article might become very useful.

 

Does someone know if I can assume that the canonical SMILES of RDKit are the 
same as the Open Babel ones?

 

Am I doing something wrong in responding to the mailing list, it looks like all 
my answers are logged as a separate message as oposed to being logged in the 
same thread - please let me know, I don't want to make it all untidy!

 

Thanks.

 
 From: da...@dalkescientific.com
 Date: Fri, 13 Feb 2009 23:21:01 +0100
 To: rdkit-discuss@lists.sourceforge.net
 Subject: Re: [Rdkit-discuss] Canonical SMILES
 
 On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
  Yes, INnChI is unique across different packages. This is because
  there is one definitive source for the code and algorithm. This was
  a design goal of InChI.
 
 
 Or to twist TJ's words around .. it's exactly the same as with 
 canonical SMILES - every implementation of InChI does it a different 
 way. It's just that there's only one InChI implementation.
 
  The book I was referring to is An Introduction to 
  Chemoinformatics from A.R. Leach and V.J. Gillet. Yes, they refer 
  to the CANGEN algorithm and to the Weininger paper you mentioned.
  It doesn't matter, as long as I'm aware of the scope of 
  'uniqueness'.
 
 Then it's an eerie coincidence that Schneider and Baringhaus use 
 exactly the same example, with exactly the same SMILES. ;)
 
 http://books.google.com/books?id=feNn- 
 JcC1KgCpg=PA25lpg=PA25dq=canonical 
 +SMILESsource=webots=CeTadvKPxAsig=46za2byYVjkOtYM1cs5- 
 xs6Bch0hl=enei=ia2VSbf1FMyL- 
 gbbguWQCQsa=Xoi=book_resultresnum=6ct=result
 
 
  in this case probably to do with which branch to deal with first)
 
 
 As I recall when trying to implement the algorithm, the ambiguity is 
 in dealing with ties. The algorithm assigns a unique ordering to the 
 atoms, up to symmetry, but it's defined at the atom level. Given an 
 atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be 
 in the same symmetry class, but with different bond types going to B1 
 and B2.
 
 I asked Weininger about it and he said choose the highest order bond 
 first, which mostly works but I think can be ambiguous for a few 
 rare cases.
 
 There may be other under-specified aspects. I haven't looked at the 
 paper in 10 years.
 
 Brian Kelley wrote an article about canonicalization, with code, for 
 Dr. Dobb's magazine. It's online at
 http://www.ddj.com/architect/184405341
 
 The algorithm isn't that hard to implement, and it can be useful (at 
 very rare times) for doing things like canonicalizing SMARTS.
 
 
 Andrew
 da...@dalkescientific.com
 
 
 
 --
 Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
 -Strategies to boost innovation and cut costs with open source participation
 -Receive a $600 discount off the registration fee with the source code: SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

_
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/

Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Andrew Dalke

On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
Does someone know if I can assume that the canonical SMILES of  
RDKit are the same as the Open Babel ones?


I wouldn't assume that without a lot of testing. My assumption
is that canonical SMILES generation is so implementation
sensitive that it's very unlikely two systems would do it the
same way unless that was a deliberate goal.

Which I know wasn't the case with those two implementations.

I think also that RDKit pays more attention to handling
stereochemistry than OpenBabel.

Am I doing something wrong in responding to the mailing list, it  
looks like all my answers are logged as a separate message as  
oposed to being logged in the same thread - please let me know, I  
don't want to make it all untidy!


I don't use a threaded mail reader so I can't tell.

Andrew
da...@dalkescientific.com





Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Noel O'Boyle
2009/2/17 Andrew Dalke da...@dalkescientific.com:
 On Feb 17, 2009, at 9:18 AM, George Oakman wrote:
 Does someone know if I can assume that the canonical SMILES of
 RDKit are the same as the Open Babel ones?

You can assume they are not the same. No attempt has been made to make
them consistent.

 I wouldn't assume that without a lot of testing. My assumption
 is that canonical SMILES generation is so implementation
 sensitive that it's very unlikely two systems would do it the
 same way unless that was a deliberate goal.

 Which I know wasn't the case with those two implementations.

 I think also that RDKit pays more attention to handling
 stereochemistry than OpenBabel.

 Am I doing something wrong in responding to the mailing list, it
 looks like all my answers are logged as a separate message as
 oposed to being logged in the same thread - please let me know, I
 don't want to make it all untidy!

 I don't use a threaded mail reader so I can't tell.
I use Gmail and everything is nicely threaded.

Andrew
da...@dalkescientific.com



 --
 Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
 -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
 -Strategies to boost innovation and cut costs with open source participation
 -Receive a $600 discount off the registration fee with the source code: SFAD
 http://p.sf.net/sfu/XcvMzF8H
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




Re: [Rdkit-discuss] Canonical SMILES

2009-02-17 Thread Greg Landrum
On Fri, Feb 13, 2009 at 11:21 PM, Andrew Dalke
da...@dalkescientific.com wrote:
 On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote:
 Yes, INnChI is unique across different packages.  This is because
 there is one definitive source for the code and algorithm.  This was
 a design goal of InChI.


 Or to twist TJ's words around .. it's exactly the same as with
 canonical SMILES - every implementation of InChI does it a different
 way. It's just that there's only one InChI implementation.

And since IUPAC has not only done an open implementation with a
reasonable license, but also trademarked the name and placed the
restriction on its use that you can't call it InChI unless you pass
their validate suite, InChI will hopefully remain a portable
canonical identifier.

 in this case probably to do with which branch to deal with first)


 As I recall when trying to implement the algorithm, the ambiguity is
 in dealing with ties. The algorithm assigns a unique ordering to the
 atoms, up to symmetry, but it's defined at the atom level. Given an
 atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be
 in the same symmetry class, but with different bond types going to B1
 and B2.

 I asked Weininger about it and he said choose the highest order bond
 first, which mostly works but I think can be ambiguous for a few
 rare cases.

I don't recall any. The decision about which bond to follow first at a
branch is really the big one.

 There may be other under-specified aspects. I haven't looked at the
 paper in 10 years.

stereochemistry is one that immediately comes to mind

-greg



Re: [Rdkit-discuss] Optimizing SSS in the RDKit

2009-02-17 Thread Andrew Dalke

On Feb 17, 2009, at 12:40 PM, Greg Landrum wrote:
Well, now I'm incredibly behind in all this. I will try to slowly  
catch up.


That'll teach you not to take a vacation.  ;)

Seriously though, I was writing as I worked, which means there's
a lot of verbiage and places where I wasn't clear on things.  The
last email puts everything together.


I've generated a new, larger, testing dataset using the pubchem HTS
compounds. I will also post the details on those (hopefully this
morning).


Cool. I've asked a few people/lists for data sets but no response
yet. There's a few I'll try.


I don't know Judy trees. Do you have a reference/pointer?



Oops, judy array
  http://judy.sourceforge.net/
  http://en.wikipedia.org/wiki/Judy_array
and I did a (buggy as it turns out) wrapper at
  http://www.dalkescientific.com/Python/PyJudy.html
when I last looked into substructure fp filters.

My idea then and now was to store a mapping from:
   unique path identifier - sorted list of matching compounds

Substructure filtering is the same as generating all paths
and finding the intersection of the sorted lists.

I think this is called an inverted index. Most paths are
rare, so storing all those paths doesn't take much space.

I was thinking that a sorted list works better than a
hash or normal trie because I could do an N-way merge
to find the intersection, rather than a lot of membership
tests. But in reflection, the latter may be faster.
Looks like more testing will occur.


They aren't by any chance connected to the thing presented in Andrew
Smellie's recent paper (haven't read it yet)?
http://pubs.acs.org/doi/abs/10.1021/ci800325v



Not at all. I really need to visit the library soon.
Or pay $30 for 24 hour access to ACS, plus unknown
price for access to Ullmann's paper.


I think it's worth looking into branched paths as well for real
substructure searches. People don't query with linear fragments all
that often, so it seems like it would be a win.



While people don't query with liner fragments, more complex
structures contain linear subparts, including crossing paths.

My thought was that linear paths are easy to generate and
canonicalize, and would give a baseline limit to more
sophisticated schemes.




Andrew
da...@dalkescientific.com