I wonder why I haven't asked this before. How is this done on OrChem?
"These columns provide a quick way to materialize a basic CDK molecule to be
passed into the VF2 algorithm. The data structures used are quite
straightforward, for instance with data in column atom "C O" interpreted as:
"atom 0 is Carbon, atom 1 is Oxygen" and bond
column "0 1 D Y" then implying "there is a bond between C (atom 0) and O
(atom 1) that is double (D) and aromatic is true (Y)". In this way, CDK
molecules can be generated very fast without the need for calculating
any properties during the search."
Is this VF2 the turbo-substructure algorithm? Or a custom one?
Do you create real cdk molecules or just some kind of graph representation you
but in your custom VF2?
Which properties do you need for VF2? Implicit hydrogens? Or is it enough to
assign an atom it's symbol "C" and each bond an order?
I'm kind of confused about the Isomorphism/ExtAtomContainerManipulator class.
in the init method, if I choose not to remove hydrogens, the search takes a lot
longer. But I have no explicit hydrogens! The ExtAtomContainerManipulator only
seems to replace explicit with implicit hydrogen count but that speeds up the
search even so I have no explicit hydrogens?
i get same number of hits in both cases.
I create molecules like this:
private IMolecule createMolecule(Integer molId,
MDLV2000Reader molReader)
throws CDKException {
Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule(20,
20, 0, 0));
mol.setID(molId.toString());
AtomContainerManipulator.percieveAtomTypesAndConfigureUnsetProperties(mol);
boolean isAromatic =
CDKHueckelAromaticityDetector.detectAromaticity(mol);
mol.setFlag(CDKConstants.ISAROMATIC, isAromatic);
CDKHydrogenAdder hydrogenAdder =
CDKHydrogenAdder.getInstance(mol.getBuilder());
hydrogenAdder.addImplicitHydrogens(mol);
return mol;
}
So I should not need to do any configuration before subgraph searching but I
need to. it explains why UIT is faster, because the
removeHydrogensExceptSingleAndPreserveAtomID method does a lot of stuff. (copy
all atoms + bonds).
Not adding hydrogens in above code (commenting it out) has no effect.
comparison.init(queryStructure, mol, true, false); is a lot faster than
comparison.init(queryStructure, mol, false, false);
Maybe MDLV2000Reader does something wrong while creating a molecule that is
fixed in removeHydrogensExceptSingleAndPreserveAtomID?
Regards,
Thomas
mol file example:
ZINC21972410
CDK 0105111003
17 17 0 0 0 0 0 0 0 0999 V2000
2.5359 5.6483 -0.1176 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5559 4.1229 -0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
3.2578 3.7205 1.2984 C 0 0 0 0 0 0 0 0 0 0 0 0
1.1205 3.5933 0.0107 C 0 0 0 0 0 0 0 0 0 0 0 0
1.1410 2.0865 0.0024 C 0 0 0 0 0 0 0 0 0 0 0 0
2.1994 1.4944 -0.0109 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.0169 1.3968 0.0097 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0021 -0.0041 0.0020 N 0 0 0 0 0 0 0 0 0 0 0 0
-1.1558 -0.6938 0.0094 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.2169 -0.1003 0.0227 O 0 0 0 0 0 0 0 0 0 0 0 0
-1.1358 -2.1698 0.0013 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.3346 -2.8981 0.0147 C 0 0 0 0 0 0 0 0 0 0 0 0
-2.2643 -4.2759 0.0123 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0172 -4.8934 -0.0033 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0967 -4.1779 -0.0212 N 0 0 0 0 0 0 0 0 0 0 0 0
0.0758 -2.8614 -0.0198 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.9420 -6.2462 -0.0053 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
2 3 1 0 0 0 0
2 4 1 0 0 0 0
4 5 1 0 0 0 0
5 6 2 0 0 0 0
5 7 1 0 0 0 0
7 8 1 0 0 0 0
8 9 1 0 0 0 0
9 10 2 0 0 0 0
9 11 1 0 0 0 0
11 16 2 0 0 0 0
11 12 1 0 0 0 0
12 13 2 0 0 0 0
13 14 1 0 0 0 0
14 17 2 0 0 0 0
14 15 1 0 0 0 0
15 16 1 0 0 0 0
M END
> From: [email protected]
> Date: Thu, 24 Feb 2011 13:18:50 +0100
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7
> Isomorphism Class
> To: [email protected]
> CC: [email protected]; [email protected];
> [email protected]
>
> Hej Thomas,
>
> On Thu, Feb 24, 2011 at 11:30 AM, Thomas Strunz <[email protected]> wrote:
> > Problem:
> > Can't filter based on H Atoms because not all P and S can be "typed"
> > correctly and hence the CDKHydrogenAdder fails and H-Count for the total
> > molecule is wrong
>
> Good. That's something we can work on :)
>
> See my post of this morning:
>
> http://chem-bla-ics.blogspot.com/2011/02/adding-cdk-atom-type.html
>
> We only need to figure out the six properties for the missing atom
> type, and the unit test needs an example structure. Preferably from
> PubChem, as I can convert that programmatically into CDK code, see:
>
> http://chem-bla-ics.blogspot.com/2008/05/wicked-chemistry-and-unit-testing.html
>
> Grtz,
>
> Egon
>
> --
> Dr E.L. Willighagen
> Postdoctoral Researcher
> Institutet för miljömedicin
> Karolinska Institutet
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: http://www.citeulike.org/user/egonw/tag/papers
------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in
Real-Time with Splunk. Collect, index and harness all the fast moving IT data
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business
insights. http://p.sf.net/sfu/splunk-dev2dev
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user