Re: [Cdk-user] Cdk-user Digest, Vol 46, Issue 3

Syed Asad Rahman Tue, 09 Mar 2010 00:01:54 -0800

Hello Everyone,

Its a very interesting topic at least from my point of view :-)


Let me clear two points.

a) Subgraph doesn't mean Max Common Subgraph (MCS)
b) In CDK UIT  (as well as with SMSD) you can obtain possible subgraph(s) of 
certain resolution/size. You just need a small tweak in the code which returns 
MCS.

Let me know if this help and I will be glad to help.

Cheers


Asad



************************************************
Syed Asad Rahman (PhD, PG, B.Engg)
Research Scientist

EMBL-EBI     
WTGC,  
Hinxton CB10 1SD                                           
Cambridge, UK   

Phone: +44- (0)1223 492537      
Fax:   +44- (0)1223 494486        
[email protected]     
www.ebi.ac.uk /~asad        
                
*************************************************

On 8 Mar 2010, at 21:33, [email protected] wrote:

> Send Cdk-user mailing list submissions to
>       [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://lists.sourceforge.net/lists/listinfo/cdk-user
> or, via email, send a message with subject or body 'help' to
>       [email protected]
> 
> You can reach the person managing the list at
>       [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Cdk-user digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Subgraph Isomorphisms (gilleain torrance)
>   2. Re: Subgraph Isomorphisms (Leonid Chepelev)
>   3. Re: Subgraph Isomorphisms (Leonid Chepelev)
>   4. Re: Subgraph Isomorphisms (gilleain torrance)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 8 Mar 2010 15:59:29 +0000
> From: gilleain torrance <[email protected]>
> Subject: Re: [Cdk-user] Subgraph Isomorphisms
> To: [email protected]
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hello,
> 
> The term 'subgraph isomorphism' is (I think) usually used to mean
> finding a subgraph of a graph G that is isomorphic to another graph H.
> What you describe sounds like the more difficult problem of finding
> all common subgraphs of G and H.
> 
> So, I am not so sure that the javadocs are unclear in this case. There
> is the possibility that H may be found multiple times in G (a polymer,
> for example, will have multiple copies of the monomer in it), which is
> presumably why the return value of these methods are lists.
> 
> As for how this could be done - well the simplest (and most expensive)
> way would be to enumerate every (connected?) subgraph of H, testing
> each one to see if it is also a subgraph of G.
> 
> Gilleain Torrance
> 
> On Mon, Mar 8, 2010 at 11:10 AM, Leonid Chepelev
> <[email protected]> wrote:
>> Hello All,
>> 
>> When trying to find all the subgraph isomorphisms, one would anticipate from
>> the javadocs?that the use of the
>> UniversalIsomorphismTester.getSubgraphAtomsMaps of CDK 1.2.5?would
>> return?ALL the subgraph isomorphisms in a molecule of interest. However,
>> that is not the case. I am posting the code below to provide an example of
>> this behaviour: for the two molecules in question, we can see that there is
>> only one MCS from the MCSS code?(as one would expect), but also only two
>> lists of RMaps from getSubgraphAtomsMaps, indicating only two subgraphs
>> identified.
>> 
>> Do you believe it would be possible to get ALL the subgraphs, no matter how
>> small, into the getSubgraphAtomsMaps, as one would anticipate from the
>> description, or is there another way to do this within (or outside CDK)?
>> 
>> Am?I doing things correctly?
>> 
>> If the behaviour of getSubgraphAtomsMaps is intentionally such that it only
>> returns?n biggest subgraph matches, may I kindly?suggest that the javadoc
>> description be changed?
>> 
>> Regards,
>> 
>> Leonid Chepelev
>> 
>> ***************************************************************
>> import java.util.Iterator;
>> import java.util.List;
>> import org.openscience.cdk.io.SMILESWriter;
>> import java.io.StringWriter;
>> import org.openscience.cdk.DefaultChemObjectBuilder;
>> import org.openscience.cdk.interfaces.IAtomContainer;
>> import org.openscience.cdk.isomorphism.mcss.RMap;
>> import org.openscience.cdk.Molecule;
>> import org.openscience.cdk.smiles.SmilesParser;
>> import org.openscience.cdk.isomorphism.UniversalIsomorphismTester;
>> public class Main {
>> ??? public static void main(String[] args) throws Exception {
>> ??????? SmilesParser smilesParser = new
>> SmilesParser(DefaultChemObjectBuilder.getInstance());
>> ??????? String referenceMolecule = "C1CCCCC1CC(=O)OCCCC";
>> ??????? String testmolecule = "C1CCCCC1CC(=O)O";
>> ??????? IAtomContainer molecule =
>> smilesParser.parseSmiles(referenceMolecule);
>> ??????? IAtomContainer lookup = smilesParser.parseSmiles(testmolecule);
>> ??????? List allthefragments =
>> UniversalIsomorphismTester.getOverlaps(molecule, lookup);
>> ??????? for (Iterator<IAtomContainer> i = allthefragments.iterator();
>> i.hasNext( ); ) {
>> ???????? IAtomContainer s = i.next( );
>> ???????? try {
>> ???????????? StringWriter funna = new StringWriter();
>> ???????????? SMILESWriter writemysmiles = new SMILESWriter();
>> ???????????? writemysmiles.setWriter(funna);
>> ???????????? Molecule themoleculetowrite = new Molecule(s);
>> ???????????? writemysmiles.write(themoleculetowrite);
>> ???????????? writemysmiles.close();
>> ???????????? System.out.println(funna.toString());
>> ???????????? funna.close();
>> ???????????? } catch (Exception e) {
>> ??????????????? System.out.println(e.toString());
>> ???????????? }
>> ??????? }
>> ??????? List<List<RMap>> atommappingsad =
>> UniversalIsomorphismTester.getSubgraphAtomsMaps(molecule, lookup);
>> ???????? for (List<RMap> j:atommappingsad) {
>> ??????????? for (RMap estro:j){
>> ??????????????? try {
>> ???????????????????? int firsta = estro.getId1();
>> ???????????????????? int seconda = estro.getId2();
>> ???????????????????? System.out.println("Atom " + firsta + " in G1?maps to
>> atom " + seconda + " in G2.");
>> ??????????????? } catch (Exception e) {
>> ???????????????????? System.out.println(e.toString());
>> ??????????????? }
>> ??????????? }
>> ???????? System.out.println("End of mapped fragment...");
>> ???????? }
>> ?? }
>> }
>> ------------------------------------------------------------------------------
>> Download Intel&#174; Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Cdk-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>> 
>> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 8 Mar 2010 13:29:56 -0500
> From: Leonid Chepelev <[email protected]>
> Subject: Re: [Cdk-user] Subgraph Isomorphisms
> To: [email protected]
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Thank you for answering, Gilleain.
> 
> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean
>> finding a subgraph of a graph G that is isomorphic to another graph H. What
>> you describe sounds like the more difficult problem of finding all common
>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in
>> this case.
> 
> 
> Hm, I couldn't really find an authoritative definition - and I am certain a
> number of people could argue about this, but from the literature I've seen,
> I always thought that two graphs may have a large number of isomorphisms,
> and it is the largest isomorphism that is selected by procedures like
> maximal common substructure searching, e.g. via the Ullman algorithm. When
> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively
> extend the initial matched fragment, it would seem a trivial task to output
> at every step of the iterative extension the subgraph, or a part of it, that
> does not yet violate edge/node matching criteria set forth in the subgraph
> isomorphism searching algorithm. As far as I understood, these subgraphs are
> collected anyway for the purposes of comparison, unless superseded by an
> extension, and only the result of the longest extension-matching procedure
> is selected as the MCS. So, I erred and believed that the
> getSubgraphAtomsMaps would give me precisely that, a collection of all the
> fragments that match and their variations.
> 
> There is the possibility that H may be found multiple times in G (a polymer,
>> for example, will have multiple copies of the monomer in it), which is
>> sumably why the return value of these methods are lists.
> 
> 
> Again, I thought that was for the MCS, but please, feel free to let me know
> if I was wrong. I just wanted to make certain that CDK does not currently do
> this and that I am not re-inventing the wheel here. And again, I suppose I
> just misinterpreted the javadoc, sorry.
> 
> As for how this could be done - well the simplest (and most expensive) way
>> would be to enumerate every (connected?) subgraph of H, testing each one to
>> see if it is also a subgraph of G. Gilleain Torrance
> 
> 
> That's one method that I have tried, however, in many cases, the cost of
> enumerating all the possible connected subgraphs of H is astronomical. For
> instance, if we cut five bonds of a molecule that contains 40 bonds, we will
> have about 7.5e10 fragments to enumerate. If we engage in growing fragments
> with all possible branches, that also often leads to combinatorial
> explosions. I have created an 'approximate solution', which is more
> efficient than what I am outlining here, but it's still an approximation.
> So, for all practical purposes, that is not a feasible approach. Well, it
> would have been easier to do it as I mentioned, that all fragments beyond a
> certain size would be reported in the generic collection at every iteration
> of the subgraph isomorphism detection tool. This has already been done and
> published numerous times, without the release of the source code. Oh well,
> if no one currently does this, I guess I'll go ahead and reinvent the wheel
> yet again.
> 
> Any other opinions, anyone? I know you are there!
> 
> Regards,
> 
> Leonid Chepelev
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 8 Mar 2010 13:32:46 -0500
> From: Leonid Chepelev <[email protected]>
> Subject: Re: [Cdk-user] Subgraph Isomorphisms
> To: [email protected]
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Sorry, in the previous message, 5 bonds out of 40 should be 10 bonds out of
> 60. The point is still valid, however.
> 
> Regards,
> 
> Leonid Chepelev
> 
> On Mon, Mar 8, 2010 at 1:29 PM, Leonid Chepelev
> <[email protected]>wrote:
> 
>> Thank you for answering, Gilleain.
>> 
>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean
>>> finding a subgraph of a graph G that is isomorphic to another graph H. What
>>> you describe sounds like the more difficult problem of finding all common
>>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in
>>> this case.
>> 
>> 
>> Hm, I couldn't really find an authoritative definition - and I am certain a
>> number of people could argue about this, but from the literature I've seen,
>> I always thought that two graphs may have a large number of isomorphisms,
>> and it is the largest isomorphism that is selected by procedures like
>> maximal common substructure searching, e.g. via the Ullman algorithm. When
>> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively
>> extend the initial matched fragment, it would seem a trivial task to output
>> at every step of the iterative extension the subgraph, or a part of it, that
>> does not yet violate edge/node matching criteria set forth in the subgraph
>> isomorphism searching algorithm. As far as I understood, these subgraphs are
>> collected anyway for the purposes of comparison, unless superseded by an
>> extension, and only the result of the longest extension-matching procedure
>> is selected as the MCS. So, I erred and believed that the
>> getSubgraphAtomsMaps would give me precisely that, a collection of all the
>> fragments that match and their variations.
>> 
>> There is the possibility that H may be found multiple times in G (a
>>> polymer, for example, will have multiple copies of the monomer in it), which
>>> is sumably why the return value of these methods are lists.
>> 
>> 
>> Again, I thought that was for the MCS, but please, feel free to let me know
>> if I was wrong. I just wanted to make certain that CDK does not currently do
>> this and that I am not re-inventing the wheel here. And again, I suppose I
>> just misinterpreted the javadoc, sorry.
>> 
>> As for how this could be done - well the simplest (and most expensive) way
>>> would be to enumerate every (connected?) subgraph of H, testing each one to
>>> see if it is also a subgraph of G. Gilleain Torrance
>> 
>> 
>> That's one method that I have tried, however, in many cases, the cost of
>> enumerating all the possible connected subgraphs of H is astronomical. For
>> instance, if we cut five bonds of a molecule that contains 40 bonds, we will
>> have about 7.5e10 fragments to enumerate. If we engage in growing fragments
>> with all possible branches, that also often leads to combinatorial
>> explosions. I have created an 'approximate solution', which is more
>> efficient than what I am outlining here, but it's still an approximation.
>> So, for all practical purposes, that is not a feasible approach. Well, it
>> would have been easier to do it as I mentioned, that all fragments beyond a
>> certain size would be reported in the generic collection at every iteration
>> of the subgraph isomorphism detection tool. This has already been done and
>> published numerous times, without the release of the source code. Oh well,
>> if no one currently does this, I guess I'll go ahead and reinvent the wheel
>> yet again.
>> 
>> Any other opinions, anyone? I know you are there!
>> 
>> Regards,
>> 
>> Leonid Chepelev
>> 
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 4
> Date: Mon, 8 Mar 2010 21:33:11 +0000
> From: gilleain torrance <[email protected]>
> Subject: Re: [Cdk-user] Subgraph Isomorphisms
> To: [email protected]
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi,
> 
> No problem. I should point out first that I am not a mathematician,
> but I have done some small work on implementing maximal common
> subgraph and subgraph isomorphism. Admittedly not in chemicals, but in
> protein folds. Terminology can differ, so I agree that the javadocs
> should be as clear as possible. Perhaps a sentence or two on what the
> code means when it says 'subgraph isomorphism' wouldn't go amiss.
> 
> I don't know much about the workings of the UIT, but it does sound
> like the scheme you outline would work. Assuming, that is, that that
> every possible starting point is taken. In other words, for each atom
> of the smaller graph the substructure is extended as much as possible.
> I guess that must happen, otherwise it wouldn't work.
> 
> Anyway, looking at the code, it seems that there is an object called
> an 'RGraph' or 'Resolution graph' that I would probably call an edge
> product graph. The maximal clique in an edge product graph is the
> maximum common substructure between the two original graphs. I haven't
> read enough of the code yet to know if this is what it is doing. The
> RGraph class is quite interesting though.
> 
> You are quite right when you say that trying every possible subgraph
> would be very computationally expensive - it wasn't a very helpful
> suggestion on my part! What might also be helpful would be if I
> finished off the signature stuff, because then this task could be
> achieved by generating canonical signatures at heights up to the
> diameter of the smaller graph, and finding matches. Slightly more
> tricky would be to find the atom-atom mappings.
> 
> Anyway, hope that the CDK can help with whatever this project is :)
> 
> Gilleain
> 
> On Mon, Mar 8, 2010 at 6:29 PM, Leonid Chepelev
> <[email protected]> wrote:
>> Thank you for answering, Gilleain.
>>> 
>>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean
>>> finding a subgraph of a graph G that is isomorphic to another graph H.?What
>>> you describe sounds like the more difficult problem of finding all common
>>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in
>>> this case.
>> 
>> Hm, I couldn't really find an authoritative definition - and I am certain a
>> number of people could argue about this, but from the literature I've seen,
>> I always thought that two graphs may have a large number of isomorphisms,
>> and it is the largest isomorphism that is selected by procedures like
>> maximal common substructure searching, e.g. via the Ullman algorithm. When
>> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively
>> extend the initial matched fragment, it would seem a trivial task to output
>> at every step of the iterative extension the subgraph, or a part of it, that
>> does not yet violate edge/node matching criteria set forth in the subgraph
>> isomorphism searching algorithm. As far as I understood, these subgraphs are
>> collected anyway for the purposes of comparison, unless superseded by an
>> extension, and only the result of the longest extension-matching procedure
>> is selected as the MCS. So, I erred and believed that the
>> getSubgraphAtomsMaps would give me precisely that, a collection of all the
>> fragments that match and their variations.
>>> 
>>> There is the possibility that H may be found multiple times in G (a
>>> polymer, for example, will have multiple copies of the monomer in it), which
>>> is sumably why the return value of these methods are lists.
>> 
>> Again, I thought that was for the MCS, but please, feel free to let me know
>> if I was wrong. I just wanted to make certain that CDK does not currently do
>> this and that I am not re-inventing the wheel here. And again, I suppose I
>> just misinterpreted the javadoc, sorry.
>>> 
>>> ?As for how this could be done - well the simplest (and most expensive)
>>> way would be to enumerate every (connected?) subgraph of H, testing each one
>>> to see if it is also a subgraph of G. Gilleain Torrance
>> 
>> That's one method that I have tried, however, in many cases, the cost of
>> enumerating all the possible connected subgraphs of H is astronomical. For
>> instance, if we cut five bonds of a molecule that contains 40 bonds, we will
>> have about 7.5e10 fragments to enumerate. If we engage in growing fragments
>> with all possible branches, that also often leads to combinatorial
>> explosions. I have created an 'approximate solution', which is more
>> efficient than what I am outlining here, but it's still an approximation.
>> So, for all practical purposes, that is not a feasible approach. Well, it
>> would have been easier to do it as I mentioned, that all fragments beyond a
>> certain size would be reported in the generic collection at every iteration
>> of the subgraph isomorphism detection tool. This has already been done and
>> published numerous times, without the release of the source code. Oh well,
>> if no one currently does this, I guess I'll go ahead and reinvent the wheel
>> yet again.
>> Any other opinions, anyone? I know you are there!
>> Regards,
>> Leonid Chepelev
>> 
>> ------------------------------------------------------------------------------
>> Download Intel&#174; Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Cdk-user mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/cdk-user
>> 
>> 
> 
> 
> 
> ------------------------------
> 
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> 
> ------------------------------
> 
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
> 
> 
> End of Cdk-user Digest, Vol 46, Issue 3
> ***************************************


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Cdk-user Digest, Vol 46, Issue 3

Reply via email to