On Mon, Oct 11, 2010 at 7:20 PM, Craig James <craig_ja...@emolecules.com> wrote: > This is a cross-post from the OpenBabel-devel mailing list. > > On 10/11/10 8:52 AM, Tim Vandermeersch wrote: >> I still need to figure out how to deal with metallocene compounds >> where there are 8 or more neighbors with the same symmetry class. I >> already have a hack to handle ferrocene but we might want to extend >> this. IIRC, this might also help kekulization? >> >> Metallocene: metal atom sandwiched between rings (4 or more atoms per ring) >> Normalization: Remove bonds connecting metal to ring atoms without >> increasing the number of disconnected fragments. Bonds will have to be >> sorted using symmetry classes to always remove the same bonds. >> >> This reduces the number of states for canonicalization dramatically. >> This also makes the smiles nicer since all the closure digits can be >> omitted. >> >> C12C3=C4[Fe]5678923(C1=C45)C1C6=C8C9=C71 --> C1=CC(C=C1)[Fe]C1C=CC=C1 >> >> Does this sound like a reasonable solution? > > I think I'd vote for this, but there are some "philosophical" issues that it > raises regarding normalizations.
> On a practical side, I think this is an excellent idea. These metallocenes > can be an algorithmic quagmire that sucks good programmers into the mud. But > if we go down this path, we have to ask much harder questions. I can probably find another way to deal with these structures but we want to get OB released at some point :-) > The point of canonicalization is to generate a single SMILES for each > molecule for database purposes. But when are two molecules the same and when > are they different? That's a very hard question. Yes, for example do I have this in stock, ... There are many ways to draw ferrocene: 2 single bonds to ring atoms, 2 single bonds to ring centroids (not sure how we handle this currently), 10 bonds, ... All of these would need to be normalized. > If we start normalizing metallocenes, why not normalize nitro, phosphate and > sulfonates? (Sorry, I'm not a chemist, I hope I got these names right.) What > about tautomers? All of these should be done if the user requests this. However, for the OpenBabel 2.3 release, we canonicalize the structure without additional normalization. For future versions, I think we should have normalization plugins that can be enabled etc. I would like to add good support for tautomers but this alone is already a reasonable big task. > The Weiningers (Dave and Art, and father Joseph contributed too) decided to > put aromaticity in as part of the definition of canonical SMILES because a > kekule representation was worthless for database use. But his original > database was only 25,000 compounds and he was only concerned about cLogP > calculations, so he left out these other cases. They just didn't matter. > > But in a modern cheminformatics system, they are equally problematic. The > Weiningers solved the aromaticity problem, but left all the others "as an > exercise for the reader" (that would be us). > > The InChI team decided to handle more problems. But they were guided by > their own internal requirements: to produce a consistent nomenclature for > IUPAC. They were NOT trying to provide a useful general-purpose solution for > cheminformatics. Yes, the InChi has extensive normalization. This would be a good starting point. > So now the OpenBabel project and OpenSMILES definition are facing a problem: > How much normalization are we going to do? Are we going to go just one step > further and decide that metallocenes should be normalized, but not nitro > groups or tautomers? Or are we going to go all the way and define clear > standards for normalizing all of these problem cases? > > At Daylight, we came up with three levels of normalization: > > Absolute SMILES: Includes stereochemistry and isotopic markings > Unique SMILES: Excludes stereochemistry and isotopes > Graph SMILES: All atoms are C, all bonds are single This is similar to the InChi layers and our canonical coding also does this although it's not an option yet. See below. > My colleague Rashmi Mistry (modgraph.co.uk) wrote the chemical registration > systems for GSK and several other large pharma companies. He came up with a > whole set of rules for normalizations that includes all of these problem > cases, plus another layer of normalization: > > Parent SMILES: Remove salts and solvates If we have normalization plugins, it would be easy to do all this. > I would argue that if we're going to start doing more normalizations for > SMILES, we should be formal about it and establish three or four formal > levels of canonicalization, much like Daylight's. The canonical code we produce is a list of numbers which is just a set of joined smaller lists. Bonds are encoded by a FROM and CLOSURE list. This is the topology of the molecule as a graph. This depends on symmetry classes and is not the same as all carbon/single bonds but this should be an option in OBGraphSym. Atom and bond types are the next layers (ATOM-TYPES & BOND-TYPES) CHARGES layer if needed The next layers could be made optional: ISOTOPES, STEREO I'll add a function parameter to allow for these layers to be set using bit-ORed flags and take care of dependencies. This can be done for 2.3. To conclude, I totally agree with all of this but it is beyond the scope of OB 2.3. Tim > Craig > > P.S. I think I'll cross-post this to the Blue Obelisk mailing list. > > ------------------------------------------------------------------------------ > Beautiful is writing same markup. Internet Explorer 9 supports > standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. > Spend less time writing and rewriting code and more time creating great > experiences on the web. Be a part of the beta today. > http://p.sf.net/sfu/beautyoftheweb > _______________________________________________ > Blueobelisk-SMILES mailing list > blueobelisk-smi...@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/blueobelisk-smiles > ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ OpenBabel-Devel mailing list OpenBabel-Devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/openbabel-devel