Hello Everyone, Its a very interesting topic at least from my point of view :-)
Let me clear two points. a) Subgraph doesn't mean Max Common Subgraph (MCS) b) In CDK UIT (as well as with SMSD) you can obtain possible subgraph(s) of certain resolution/size. You just need a small tweak in the code which returns MCS. Let me know if this help and I will be glad to help. Cheers Asad ************************************************ Syed Asad Rahman (PhD, PG, B.Engg) Research Scientist EMBL-EBI WTGC, Hinxton CB10 1SD Cambridge, UK Phone: +44- (0)1223 492537 Fax: +44- (0)1223 494486 [email protected] www.ebi.ac.uk /~asad ************************************************* On 8 Mar 2010, at 21:33, [email protected] wrote: > Send Cdk-user mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/cdk-user > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Cdk-user digest..." > > > Today's Topics: > > 1. Re: Subgraph Isomorphisms (gilleain torrance) > 2. Re: Subgraph Isomorphisms (Leonid Chepelev) > 3. Re: Subgraph Isomorphisms (Leonid Chepelev) > 4. Re: Subgraph Isomorphisms (gilleain torrance) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 8 Mar 2010 15:59:29 +0000 > From: gilleain torrance <[email protected]> > Subject: Re: [Cdk-user] Subgraph Isomorphisms > To: [email protected] > Message-ID: > <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > The term 'subgraph isomorphism' is (I think) usually used to mean > finding a subgraph of a graph G that is isomorphic to another graph H. > What you describe sounds like the more difficult problem of finding > all common subgraphs of G and H. > > So, I am not so sure that the javadocs are unclear in this case. There > is the possibility that H may be found multiple times in G (a polymer, > for example, will have multiple copies of the monomer in it), which is > presumably why the return value of these methods are lists. > > As for how this could be done - well the simplest (and most expensive) > way would be to enumerate every (connected?) subgraph of H, testing > each one to see if it is also a subgraph of G. > > Gilleain Torrance > > On Mon, Mar 8, 2010 at 11:10 AM, Leonid Chepelev > <[email protected]> wrote: >> Hello All, >> >> When trying to find all the subgraph isomorphisms, one would anticipate from >> the javadocs?that the use of the >> UniversalIsomorphismTester.getSubgraphAtomsMaps of CDK 1.2.5?would >> return?ALL the subgraph isomorphisms in a molecule of interest. However, >> that is not the case. I am posting the code below to provide an example of >> this behaviour: for the two molecules in question, we can see that there is >> only one MCS from the MCSS code?(as one would expect), but also only two >> lists of RMaps from getSubgraphAtomsMaps, indicating only two subgraphs >> identified. >> >> Do you believe it would be possible to get ALL the subgraphs, no matter how >> small, into the getSubgraphAtomsMaps, as one would anticipate from the >> description, or is there another way to do this within (or outside CDK)? >> >> Am?I doing things correctly? >> >> If the behaviour of getSubgraphAtomsMaps is intentionally such that it only >> returns?n biggest subgraph matches, may I kindly?suggest that the javadoc >> description be changed? >> >> Regards, >> >> Leonid Chepelev >> >> *************************************************************** >> import java.util.Iterator; >> import java.util.List; >> import org.openscience.cdk.io.SMILESWriter; >> import java.io.StringWriter; >> import org.openscience.cdk.DefaultChemObjectBuilder; >> import org.openscience.cdk.interfaces.IAtomContainer; >> import org.openscience.cdk.isomorphism.mcss.RMap; >> import org.openscience.cdk.Molecule; >> import org.openscience.cdk.smiles.SmilesParser; >> import org.openscience.cdk.isomorphism.UniversalIsomorphismTester; >> public class Main { >> ??? public static void main(String[] args) throws Exception { >> ??????? SmilesParser smilesParser = new >> SmilesParser(DefaultChemObjectBuilder.getInstance()); >> ??????? String referenceMolecule = "C1CCCCC1CC(=O)OCCCC"; >> ??????? String testmolecule = "C1CCCCC1CC(=O)O"; >> ??????? IAtomContainer molecule = >> smilesParser.parseSmiles(referenceMolecule); >> ??????? IAtomContainer lookup = smilesParser.parseSmiles(testmolecule); >> ??????? List allthefragments = >> UniversalIsomorphismTester.getOverlaps(molecule, lookup); >> ??????? for (Iterator<IAtomContainer> i = allthefragments.iterator(); >> i.hasNext( ); ) { >> ???????? IAtomContainer s = i.next( ); >> ???????? try { >> ???????????? StringWriter funna = new StringWriter(); >> ???????????? SMILESWriter writemysmiles = new SMILESWriter(); >> ???????????? writemysmiles.setWriter(funna); >> ???????????? Molecule themoleculetowrite = new Molecule(s); >> ???????????? writemysmiles.write(themoleculetowrite); >> ???????????? writemysmiles.close(); >> ???????????? System.out.println(funna.toString()); >> ???????????? funna.close(); >> ???????????? } catch (Exception e) { >> ??????????????? System.out.println(e.toString()); >> ???????????? } >> ??????? } >> ??????? List<List<RMap>> atommappingsad = >> UniversalIsomorphismTester.getSubgraphAtomsMaps(molecule, lookup); >> ???????? for (List<RMap> j:atommappingsad) { >> ??????????? for (RMap estro:j){ >> ??????????????? try { >> ???????????????????? int firsta = estro.getId1(); >> ???????????????????? int seconda = estro.getId2(); >> ???????????????????? System.out.println("Atom " + firsta + " in G1?maps to >> atom " + seconda + " in G2."); >> ??????????????? } catch (Exception e) { >> ???????????????????? System.out.println(e.toString()); >> ??????????????? } >> ??????????? } >> ???????? System.out.println("End of mapped fragment..."); >> ???????? } >> ?? } >> } >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Cdk-user mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> >> > > > > ------------------------------ > > Message: 2 > Date: Mon, 8 Mar 2010 13:29:56 -0500 > From: Leonid Chepelev <[email protected]> > Subject: Re: [Cdk-user] Subgraph Isomorphisms > To: [email protected] > Message-ID: > <[email protected]> > Content-Type: text/plain; charset="iso-8859-1" > > Thank you for answering, Gilleain. > > Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >> finding a subgraph of a graph G that is isomorphic to another graph H. What >> you describe sounds like the more difficult problem of finding all common >> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >> this case. > > > Hm, I couldn't really find an authoritative definition - and I am certain a > number of people could argue about this, but from the literature I've seen, > I always thought that two graphs may have a large number of isomorphisms, > and it is the largest isomorphism that is selected by procedures like > maximal common substructure searching, e.g. via the Ullman algorithm. When > one starts with an atom/bond pair in G1 and G2 and attempts to iteratively > extend the initial matched fragment, it would seem a trivial task to output > at every step of the iterative extension the subgraph, or a part of it, that > does not yet violate edge/node matching criteria set forth in the subgraph > isomorphism searching algorithm. As far as I understood, these subgraphs are > collected anyway for the purposes of comparison, unless superseded by an > extension, and only the result of the longest extension-matching procedure > is selected as the MCS. So, I erred and believed that the > getSubgraphAtomsMaps would give me precisely that, a collection of all the > fragments that match and their variations. > > There is the possibility that H may be found multiple times in G (a polymer, >> for example, will have multiple copies of the monomer in it), which is >> sumably why the return value of these methods are lists. > > > Again, I thought that was for the MCS, but please, feel free to let me know > if I was wrong. I just wanted to make certain that CDK does not currently do > this and that I am not re-inventing the wheel here. And again, I suppose I > just misinterpreted the javadoc, sorry. > > As for how this could be done - well the simplest (and most expensive) way >> would be to enumerate every (connected?) subgraph of H, testing each one to >> see if it is also a subgraph of G. Gilleain Torrance > > > That's one method that I have tried, however, in many cases, the cost of > enumerating all the possible connected subgraphs of H is astronomical. For > instance, if we cut five bonds of a molecule that contains 40 bonds, we will > have about 7.5e10 fragments to enumerate. If we engage in growing fragments > with all possible branches, that also often leads to combinatorial > explosions. I have created an 'approximate solution', which is more > efficient than what I am outlining here, but it's still an approximation. > So, for all practical purposes, that is not a feasible approach. Well, it > would have been easier to do it as I mentioned, that all fragments beyond a > certain size would be reported in the generic collection at every iteration > of the subgraph isomorphism detection tool. This has already been done and > published numerous times, without the release of the source code. Oh well, > if no one currently does this, I guess I'll go ahead and reinvent the wheel > yet again. > > Any other opinions, anyone? I know you are there! > > Regards, > > Leonid Chepelev > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 3 > Date: Mon, 8 Mar 2010 13:32:46 -0500 > From: Leonid Chepelev <[email protected]> > Subject: Re: [Cdk-user] Subgraph Isomorphisms > To: [email protected] > Message-ID: > <[email protected]> > Content-Type: text/plain; charset="iso-8859-1" > > Sorry, in the previous message, 5 bonds out of 40 should be 10 bonds out of > 60. The point is still valid, however. > > Regards, > > Leonid Chepelev > > On Mon, Mar 8, 2010 at 1:29 PM, Leonid Chepelev > <[email protected]>wrote: > >> Thank you for answering, Gilleain. >> >> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >>> finding a subgraph of a graph G that is isomorphic to another graph H. What >>> you describe sounds like the more difficult problem of finding all common >>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >>> this case. >> >> >> Hm, I couldn't really find an authoritative definition - and I am certain a >> number of people could argue about this, but from the literature I've seen, >> I always thought that two graphs may have a large number of isomorphisms, >> and it is the largest isomorphism that is selected by procedures like >> maximal common substructure searching, e.g. via the Ullman algorithm. When >> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively >> extend the initial matched fragment, it would seem a trivial task to output >> at every step of the iterative extension the subgraph, or a part of it, that >> does not yet violate edge/node matching criteria set forth in the subgraph >> isomorphism searching algorithm. As far as I understood, these subgraphs are >> collected anyway for the purposes of comparison, unless superseded by an >> extension, and only the result of the longest extension-matching procedure >> is selected as the MCS. So, I erred and believed that the >> getSubgraphAtomsMaps would give me precisely that, a collection of all the >> fragments that match and their variations. >> >> There is the possibility that H may be found multiple times in G (a >>> polymer, for example, will have multiple copies of the monomer in it), which >>> is sumably why the return value of these methods are lists. >> >> >> Again, I thought that was for the MCS, but please, feel free to let me know >> if I was wrong. I just wanted to make certain that CDK does not currently do >> this and that I am not re-inventing the wheel here. And again, I suppose I >> just misinterpreted the javadoc, sorry. >> >> As for how this could be done - well the simplest (and most expensive) way >>> would be to enumerate every (connected?) subgraph of H, testing each one to >>> see if it is also a subgraph of G. Gilleain Torrance >> >> >> That's one method that I have tried, however, in many cases, the cost of >> enumerating all the possible connected subgraphs of H is astronomical. For >> instance, if we cut five bonds of a molecule that contains 40 bonds, we will >> have about 7.5e10 fragments to enumerate. If we engage in growing fragments >> with all possible branches, that also often leads to combinatorial >> explosions. I have created an 'approximate solution', which is more >> efficient than what I am outlining here, but it's still an approximation. >> So, for all practical purposes, that is not a feasible approach. Well, it >> would have been easier to do it as I mentioned, that all fragments beyond a >> certain size would be reported in the generic collection at every iteration >> of the subgraph isomorphism detection tool. This has already been done and >> published numerous times, without the release of the source code. Oh well, >> if no one currently does this, I guess I'll go ahead and reinvent the wheel >> yet again. >> >> Any other opinions, anyone? I know you are there! >> >> Regards, >> >> Leonid Chepelev >> >> > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 4 > Date: Mon, 8 Mar 2010 21:33:11 +0000 > From: gilleain torrance <[email protected]> > Subject: Re: [Cdk-user] Subgraph Isomorphisms > To: [email protected] > Message-ID: > <[email protected]> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi, > > No problem. I should point out first that I am not a mathematician, > but I have done some small work on implementing maximal common > subgraph and subgraph isomorphism. Admittedly not in chemicals, but in > protein folds. Terminology can differ, so I agree that the javadocs > should be as clear as possible. Perhaps a sentence or two on what the > code means when it says 'subgraph isomorphism' wouldn't go amiss. > > I don't know much about the workings of the UIT, but it does sound > like the scheme you outline would work. Assuming, that is, that that > every possible starting point is taken. In other words, for each atom > of the smaller graph the substructure is extended as much as possible. > I guess that must happen, otherwise it wouldn't work. > > Anyway, looking at the code, it seems that there is an object called > an 'RGraph' or 'Resolution graph' that I would probably call an edge > product graph. The maximal clique in an edge product graph is the > maximum common substructure between the two original graphs. I haven't > read enough of the code yet to know if this is what it is doing. The > RGraph class is quite interesting though. > > You are quite right when you say that trying every possible subgraph > would be very computationally expensive - it wasn't a very helpful > suggestion on my part! What might also be helpful would be if I > finished off the signature stuff, because then this task could be > achieved by generating canonical signatures at heights up to the > diameter of the smaller graph, and finding matches. Slightly more > tricky would be to find the atom-atom mappings. > > Anyway, hope that the CDK can help with whatever this project is :) > > Gilleain > > On Mon, Mar 8, 2010 at 6:29 PM, Leonid Chepelev > <[email protected]> wrote: >> Thank you for answering, Gilleain. >>> >>> Hello, The term 'subgraph isomorphism' is (I think) usually used to mean >>> finding a subgraph of a graph G that is isomorphic to another graph H.?What >>> you describe sounds like the more difficult problem of finding all common >>> subgraphs of G and H. So, I am not so sure that the javadocs are unclear in >>> this case. >> >> Hm, I couldn't really find an authoritative definition - and I am certain a >> number of people could argue about this, but from the literature I've seen, >> I always thought that two graphs may have a large number of isomorphisms, >> and it is the largest isomorphism that is selected by procedures like >> maximal common substructure searching, e.g. via the Ullman algorithm. When >> one starts with an atom/bond pair in G1 and G2 and attempts to iteratively >> extend the initial matched fragment, it would seem a trivial task to output >> at every step of the iterative extension the subgraph, or a part of it, that >> does not yet violate edge/node matching criteria set forth in the subgraph >> isomorphism searching algorithm. As far as I understood, these subgraphs are >> collected anyway for the purposes of comparison, unless superseded by an >> extension, and only the result of the longest extension-matching procedure >> is selected as the MCS. So, I erred and believed that the >> getSubgraphAtomsMaps would give me precisely that, a collection of all the >> fragments that match and their variations. >>> >>> There is the possibility that H may be found multiple times in G (a >>> polymer, for example, will have multiple copies of the monomer in it), which >>> is sumably why the return value of these methods are lists. >> >> Again, I thought that was for the MCS, but please, feel free to let me know >> if I was wrong. I just wanted to make certain that CDK does not currently do >> this and that I am not re-inventing the wheel here. And again, I suppose I >> just misinterpreted the javadoc, sorry. >>> >>> ?As for how this could be done - well the simplest (and most expensive) >>> way would be to enumerate every (connected?) subgraph of H, testing each one >>> to see if it is also a subgraph of G. Gilleain Torrance >> >> That's one method that I have tried, however, in many cases, the cost of >> enumerating all the possible connected subgraphs of H is astronomical. For >> instance, if we cut five bonds of a molecule that contains 40 bonds, we will >> have about 7.5e10 fragments to enumerate. If we engage in growing fragments >> with all possible branches, that also often leads to combinatorial >> explosions. I have created an 'approximate solution', which is more >> efficient than what I am outlining here, but it's still an approximation. >> So, for all practical purposes, that is not a feasible approach. Well, it >> would have been easier to do it as I mentioned, that all fragments beyond a >> certain size would be reported in the generic collection at every iteration >> of the subgraph isomorphism detection tool. This has already been done and >> published numerous times, without the release of the source code. Oh well, >> if no one currently does this, I guess I'll go ahead and reinvent the wheel >> yet again. >> Any other opinions, anyone? I know you are there! >> Regards, >> Leonid Chepelev >> >> ------------------------------------------------------------------------------ >> Download Intel® Parallel Studio Eval >> Try the new software tools for yourself. Speed compiling, find bugs >> proactively, and fine-tune applications for parallel performance. >> See why Intel Parallel Studio got high marks during beta. >> http://p.sf.net/sfu/intel-sw-dev >> _______________________________________________ >> Cdk-user mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/cdk-user >> >> > > > > ------------------------------ > > ------------------------------------------------------------------------------ > Download Intel® Parallel Studio Eval > Try the new software tools for yourself. Speed compiling, find bugs > proactively, and fine-tune applications for parallel performance. > See why Intel Parallel Studio got high marks during beta. > http://p.sf.net/sfu/intel-sw-dev > > ------------------------------ > > _______________________________________________ > Cdk-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/cdk-user > > > End of Cdk-user Digest, Vol 46, Issue 3 > *************************************** ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Cdk-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/cdk-user

