Re: [Rdkit-discuss] Canonical SMILES
Hi, Thank you all very much for all the detailed information, the link to the Dr. Dobb's article might become very useful. Does someone know if I can assume that the canonical SMILES of RDKit are the same as the Open Babel ones? Am I doing something wrong in responding to the mailing list, it looks like all my answers are logged as a separate message as oposed to being logged in the same thread - please let me know, I don't want to make it all untidy! Thanks. From: da...@dalkescientific.com Date: Fri, 13 Feb 2009 23:21:01 +0100 To: rdkit-discuss@lists.sourceforge.net Subject: Re: [Rdkit-discuss] Canonical SMILES On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote: Yes, INnChI is unique across different packages. This is because there is one definitive source for the code and algorithm. This was a design goal of InChI. Or to twist TJ's words around .. it's exactly the same as with canonical SMILES - every implementation of InChI does it a different way. It's just that there's only one InChI implementation. The book I was referring to is An Introduction to Chemoinformatics from A.R. Leach and V.J. Gillet. Yes, they refer to the CANGEN algorithm and to the Weininger paper you mentioned. It doesn't matter, as long as I'm aware of the scope of 'uniqueness'. Then it's an eerie coincidence that Schneider and Baringhaus use exactly the same example, with exactly the same SMILES. ;) http://books.google.com/books?id=feNn- JcC1KgCpg=PA25lpg=PA25dq=canonical +SMILESsource=webots=CeTadvKPxAsig=46za2byYVjkOtYM1cs5- xs6Bch0hl=enei=ia2VSbf1FMyL- gbbguWQCQsa=Xoi=book_resultresnum=6ct=result in this case probably to do with which branch to deal with first) As I recall when trying to implement the algorithm, the ambiguity is in dealing with ties. The algorithm assigns a unique ordering to the atoms, up to symmetry, but it's defined at the atom level. Given an atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be in the same symmetry class, but with different bond types going to B1 and B2. I asked Weininger about it and he said choose the highest order bond first, which mostly works but I think can be ambiguous for a few rare cases. There may be other under-specified aspects. I haven't looked at the paper in 10 years. Brian Kelley wrote an article about canonicalization, with code, for Dr. Dobb's magazine. It's online at http://www.ddj.com/architect/184405341 The algorithm isn't that hard to implement, and it can be useful (at very rare times) for doing things like canonicalizing SMARTS. Andrew da...@dalkescientific.com -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss _ Make a mini you and download it into Windows Live Messenger http://clk.atdmt.com/UKM/go/111354029/direct/01/
Re: [Rdkit-discuss] Canonical SMILES
On Feb 17, 2009, at 9:18 AM, George Oakman wrote: Does someone know if I can assume that the canonical SMILES of RDKit are the same as the Open Babel ones? I wouldn't assume that without a lot of testing. My assumption is that canonical SMILES generation is so implementation sensitive that it's very unlikely two systems would do it the same way unless that was a deliberate goal. Which I know wasn't the case with those two implementations. I think also that RDKit pays more attention to handling stereochemistry than OpenBabel. Am I doing something wrong in responding to the mailing list, it looks like all my answers are logged as a separate message as oposed to being logged in the same thread - please let me know, I don't want to make it all untidy! I don't use a threaded mail reader so I can't tell. Andrew da...@dalkescientific.com
Re: [Rdkit-discuss] Canonical SMILES
2009/2/17 Andrew Dalke da...@dalkescientific.com: On Feb 17, 2009, at 9:18 AM, George Oakman wrote: Does someone know if I can assume that the canonical SMILES of RDKit are the same as the Open Babel ones? You can assume they are not the same. No attempt has been made to make them consistent. I wouldn't assume that without a lot of testing. My assumption is that canonical SMILES generation is so implementation sensitive that it's very unlikely two systems would do it the same way unless that was a deliberate goal. Which I know wasn't the case with those two implementations. I think also that RDKit pays more attention to handling stereochemistry than OpenBabel. Am I doing something wrong in responding to the mailing list, it looks like all my answers are logged as a separate message as oposed to being logged in the same thread - please let me know, I don't want to make it all untidy! I don't use a threaded mail reader so I can't tell. I use Gmail and everything is nicely threaded. Andrew da...@dalkescientific.com -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Canonical SMILES
On Fri, Feb 13, 2009 at 11:21 PM, Andrew Dalke da...@dalkescientific.com wrote: On Feb 13, 2009, at 9:14 PM, TJ O'Donnell wrote: Yes, INnChI is unique across different packages. This is because there is one definitive source for the code and algorithm. This was a design goal of InChI. Or to twist TJ's words around .. it's exactly the same as with canonical SMILES - every implementation of InChI does it a different way. It's just that there's only one InChI implementation. And since IUPAC has not only done an open implementation with a reasonable license, but also trademarked the name and placed the restriction on its use that you can't call it InChI unless you pass their validate suite, InChI will hopefully remain a portable canonical identifier. in this case probably to do with which branch to deal with first) As I recall when trying to implement the algorithm, the ambiguity is in dealing with ties. The algorithm assigns a unique ordering to the atoms, up to symmetry, but it's defined at the atom level. Given an atom A bonded to atoms B1 and B2, it's possible for B1 and B2 to be in the same symmetry class, but with different bond types going to B1 and B2. I asked Weininger about it and he said choose the highest order bond first, which mostly works but I think can be ambiguous for a few rare cases. I don't recall any. The decision about which bond to follow first at a branch is really the big one. There may be other under-specified aspects. I haven't looked at the paper in 10 years. stereochemistry is one that immediately comes to mind -greg
Re: [Rdkit-discuss] Optimizing SSS in the RDKit
On Feb 17, 2009, at 12:40 PM, Greg Landrum wrote: Well, now I'm incredibly behind in all this. I will try to slowly catch up. That'll teach you not to take a vacation. ;) Seriously though, I was writing as I worked, which means there's a lot of verbiage and places where I wasn't clear on things. The last email puts everything together. I've generated a new, larger, testing dataset using the pubchem HTS compounds. I will also post the details on those (hopefully this morning). Cool. I've asked a few people/lists for data sets but no response yet. There's a few I'll try. I don't know Judy trees. Do you have a reference/pointer? Oops, judy array http://judy.sourceforge.net/ http://en.wikipedia.org/wiki/Judy_array and I did a (buggy as it turns out) wrapper at http://www.dalkescientific.com/Python/PyJudy.html when I last looked into substructure fp filters. My idea then and now was to store a mapping from: unique path identifier - sorted list of matching compounds Substructure filtering is the same as generating all paths and finding the intersection of the sorted lists. I think this is called an inverted index. Most paths are rare, so storing all those paths doesn't take much space. I was thinking that a sorted list works better than a hash or normal trie because I could do an N-way merge to find the intersection, rather than a lot of membership tests. But in reflection, the latter may be faster. Looks like more testing will occur. They aren't by any chance connected to the thing presented in Andrew Smellie's recent paper (haven't read it yet)? http://pubs.acs.org/doi/abs/10.1021/ci800325v Not at all. I really need to visit the library soon. Or pay $30 for 24 hour access to ACS, plus unknown price for access to Ullmann's paper. I think it's worth looking into branched paths as well for real substructure searches. People don't query with linear fragments all that often, so it seems like it would be a win. While people don't query with liner fragments, more complex structures contain linear subparts, including crossing paths. My thought was that linear paths are easy to generate and canonicalize, and would give a baseline limit to more sophisticated schemes. Andrew da...@dalkescientific.com