Re: [Cdk-user] Deduce Bond System Tool

kevin.lawson Thu, 15 Mar 2012 11:13:13 -0700

Hi Egon

-----Original Message-----
From: Egon Willighagen [mailto:[email protected]]
Sent: 15 March 2012 15:27
To: Lawson Kevin GBJH
Cc: [email protected]
Subject: Re: [Cdk-user] Deduce Bond System Tool

> Hi Kevin,

On Wed, Mar 14, 2012 at 6:18 PM,  <[email protected]> wrote:
>> I have been having problems with the Deduce Bond System Tool.  When I use
>> the routine FixAromaticBondOrders I find that it quite often times out (100
>> seconds)

> That is because of the AllRingsFinder being used, which is an
> exhaustive ring searchers and indeed time consuming algorithm.

I have tried this out in debug mode and the AllRingsFinder actually works fine 
with the examples I have used.  The problem arises because the algorithm needs 
to potentially run through 18 combinations of double bonds for each 6-membered 
ring and then expensively test each molecule produced.  That's 18^5 expensive 
tests for a molecule with 5 rings in it.  Usually this doesn't matter, because 
it only needs to test a few combinations to find a suitable one - the problem 
arises when all tests fail for a particular ring (like the pyranone below) - it 
then times out before it can try out all the necessary combinations.

> However, it has already been said to not search for rings larger than
> 7 atoms... so, that combo might indicate a bug.

>> if the molecule contains 5 or more (possibly unconnected) rings and
>> one of them has complexities such that a simple solution cannot be found.
>> On inspecting the source code (version 1.4.2), it appears that all rings in
>> the molecule being tested are treated combinatorially

> Which is in fact needed, because you cannot know on beforehand which
> bonds lack the required 'double' order flag, and which are OK... that
> is, on the old CDK world... and the Tool has been written in that old
> world.

As long as a ring is isolated from other rings then it's surely OK to find a 
'local' solution for it?

>In fact, this was one of the reasons why I set out to rewrite atom
>detection in the CDK.

>> - all combinations of
>> possible solutions for each ring being tested.  It occurred to me that it
>> would potentially be faster (for complex molecules) to divide the rings up
>> into groups of interconnected rings and treat these separately.

> I use this trick somewhere else too, in the aromaticity detector, if
> not mistaken...

>> The attached code (which I have been trying out in the LICSS system) does 
>> just
>> this - the core algorithm is unchanged, but is applied sequentially to each
>> group of interconnected rings rather than to the whole molecule. In my
>> hands, this makes things much faster.  For example, Smiles such as
>> Oc1c(C2CC(Cc3ccccc23)c4ccc(cc4)c5ccccc5)c(=O)oc6ccccc16 (Difenacoum) is now
>> configured very quickly with four kekulised aromatic rings, the remaining
>> pyranone ring (which isn't recognised as aromatic) is returned unchanged.
>> With the previous algorithm, this molecule timed out.

> I was discussing the class with Ola (Bioclipse) this week, and since
> so many people use input formats where double bond localization is
> missing (SMILES as above, but also MDL molfiles with query bond type
> 4), we just have to have a good algorithm.

> I think your approach is a good one, and wonder why it times out in
> the first place, because the AllRingsFinder is supposed to stop after
> finding 7 atoms anyway... that should be enough a termination
> condition for it the be fast anyway... I think we might have a
> regression...

See earlier comment - there is no issue here with the AllRingsFinder

> The other thought is that both input types have more information:
> which bonds do not have a well-defined bond order. For SMILES this is
> each bond between two lower case organic subset atoms. For MDL
> molfiles it is the bonds of type 4.

> Thus, I think we should not search in just small ring systems (as they
> might easily get large enough for the AllRingsFinger to explode on
> anyway), but restrict the searching on just the unknown bits.

Yes definitely.

> This idea I was planning to discuss next week Thursday with Ola and
> Klas (cc-ed), looking into implementing that approach. That should
> even be faster and I expect more accurate, as we might not fall into
> the trap changing bond orders that were explicitly single in the
> SMILES or molfile in the first place...

> (That may sound obscure, but was in fact one of the issue with the
> double bond recovering tool we used before the we got this code
> donated by the EPA...)

> That set aside... we always have room in the CDK for alternative
> algorithms, and the thing I am thinking of requires more prior
> information that your suggestion does...

>> When I originally tried this approach to grouping the interconnected rings,
>> I was expecting to have to rewrite the IsStructureOK routine which checks
>> each structure for being 'properly' aromatic.

> The current 1.4.x DeduceBondSystemTool does not know about the atom
> types in the structure... how does your code figure out what is
> aromatic?

My code just uses the same IsStructureOK routine that the original CDK 1.4.2 
DeduceBondSystemTool does.  But while this is fine at rejecting atoms with more 
bonds than valency available it doesn't detect atoms with fewer bonds than 
available valency.  Indeed, as stated it 'passes' all singly bonded 6 membered 
rings (with atoms marked sp2).  Now the CDK core Loop algorithm first tries 
combinations with the maximum number of double bonds.  However with particular 
orders of ring combinations it can come unstuck.  Thus for acridine if it first 
"happens" to try the outer rings and generate the two possible combinations of 
outer ring with no double bond shared with the central ring then the resulting 
molecule with only 6 double bonds (instead of the necessary 7) is passed as OK 
by IsStructureOK and is the final molecule returned.  I say "happens" because, 
somewhat surprisingly, the combination of Smiles parser-FixAromaticBondOrders 
and Smiles generator does not always return the same kekulised Smiles for the 
same input Smiles!  This is because the order of rings generated by the 
AllRingsFinder is not always the same; I think that is actually because the 
Smiles parser doesn't always return an IMolecule with the same atom/bond 
numbering but I've not looked at this in detail.

I did have a brief look at rewriting IsStructureOK to ignore Aromaticity and 
just see if the double bond layouts were correct for each atom.  To do that 
requires knowledge of atom types and how the rings are bonded to the rest of 
the molecule (eg whether an SP2 nitrogen bears a substituent or not).  I came 
to the conclusion that if you were going to use this sort of info you would 
more sensibly use it to work out correct double bond layouts in the first place 
rather than use the combinatorial guess approach (see suggested approach at the 
bottom of the email)

>> In fact, no modification is
>> necessary in that it quite happily accepts rings with no double bonds as
>> aromatic (if the atoms are sp2 hybridised).  This can lead to problems with
>> both the existing algorithm (and my variant).  For example, anthracene or
>> acridine are sometimes kekulised with the central ring containing no double
>> bonds (depending on the exact ordering of the rings detected by the ring
>> finder - itself dependent, I believe, on how the Smiles parser happens to
>> set up the IMolecule). If you do decide to look at my suggested approach,
>> and also fix the aromaticity checking, it will be necessary to make sure
>> that it is applied only to those rings in each interconnected group.

> I very much like to work things out... one aspect is, is to isolate a
> set of good structures that (alternative) implementations should be
> able to solve in reasonable time...

I've been using the Wellcome Anti-malarials set and also the Pesticide Manual 
(which is where the Difenacoum example came from)...

> The second thing is how we want to do things. The DeduceBondSystemTool
> effectively uses it's own atom type code; I personally like to reuse
> the central atom typer, which we can independently test. Same for
> aromaticity detection... but, aromaticity itself is not important for
> deducing double bond locations... it's just the delocalized nature we
> want to get rid of, and should not even care about what is aromatic,
> anti-aromatic, or non-aromatic...

What you say sounds sensible.  I am also not at all sure what part, if any, 
aromaticity detection should be involved.  In the Smiles case, can't one just 
try and layout double bonds for rings marked entirely with lower case atoms and 
see if a properly alternating solution drops out?  BTW I had a brief go at 
entirely recoding the FixAromaticBondOrders algorithm before I realised my 
knowledge of CDK and Java just wasn't up to it (this side of Christmas).  Just 
in case it's any use here's what I was trying to do:
 1. Find all the rings marked as aromatic (c1ccn.. etc) and divide them up into 
groups of interconnected rings (as in the code I attached earlier)
 2. Set up an ordered graph of each ring group including all necessary atom 
type info (+ 'free' valencies etc) for the connecting atoms
 3. Find a suitable starting atom preferably with only one possibility (eg N on 
a ring junction must have all single bonds to it or any O or S atom)
 4. Traverse the graph laying out the double bonds recording any positions 
where a choice must be made and backtracking if necessary (either ring 
junctions or SP2 Nitrogen atoms which could potentially have two single bonds 
[+ implicit H] or a double + single bond).  In the end this is just a colouring 
problem.

I imagine that this would be way faster than the present approach which as I 
understand it just tests each possible combination of double bond layouts for 
each ring.  Firstly, I think that with all the Atom typing info available you 
would normally only need a very small amount of backtracking (and thus wouldn't 
need to try out many combinations). Crucially, you wouldn't need an expensive 
test for each molecule produced - when you reach the end of the graph it has 
either worked or it hasn't!

Conclusions:

1. I think the modified algorithm I originally attached will generate the same 
results as the original CDK 1.4.2 algorithm but much faster for molecules 
containing multiple, but mostly unconnected rings (unless I have mucked up 
somewhere)
2. The present algorithm's aromaticity test is not sufficient to demonstrate a 
molecule has all the double bonds layed out correctly
3. There's almost certainly a better, faster, more complete way of doing this 
using all the extra atom type and bonding info available..
4. ..but the idea of splitting up into groups of interconnected rings may be a 
useful first step anyway

Best wishes

Kevin

> Thoughts?

> Egon

--
Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers

Syngenta Limited, Registered in England No 2710846
Registered Office : Syngenta Limited, European Regional Centre, Priestley Road, 
Surrey Research Park, Guildford, Surrey, GU2 7YH, United Kingdom

This message may contain confidential information. If you are not the 
designated recipient, please notify the sender immediately, and delete the 
original and any copies. Any use of the message by you is prohibited. 

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Deduce Bond System Tool

Reply via email to