Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Thomas Strunz Fri, 25 Feb 2011 00:33:22 -0800

I wonder why I haven't asked this before. How is this done on OrChem? 

"These columns provide a quick way to materialize a basic CDK molecule to be 
passed into the VF2 algorithm.  The data structures used are quite 
straightforward, for instance with data in column atom "C O" interpreted as: 
"atom 0 is Carbon, atom 1 is Oxygen" and bond
 column "0 1 D Y" then implying "there is a bond between C (atom 0) and O
 (atom 1) that is double (D) and aromatic is true (Y)". In this way, CDK
 molecules can be generated very fast without the need for calculating 
any properties during the search."


Is this VF2 the turbo-substructure algorithm? Or a custom one?
Do you create real cdk molecules or just some kind of graph representation you 
but in your custom VF2?

Which properties do you need for VF2? Implicit hydrogens? Or is it enough to 
assign an atom it's symbol "C" and each bond an order?

I'm kind of confused about the Isomorphism/ExtAtomContainerManipulator class. 
in the init method, if I choose not to remove hydrogens, the search takes a lot 
longer. But I have no explicit hydrogens! The ExtAtomContainerManipulator only 
seems to replace explicit with implicit hydrogen count but that speeds up the 
search even so I have no explicit hydrogens?
i get same number of hits in both cases.

I create molecules like this:

private IMolecule createMolecule(Integer molId,
            MDLV2000Reader molReader)
            throws CDKException {

        Molecule mol = (Molecule) molReader.read((ChemObject) new Molecule(20, 
20, 0, 0));
        mol.setID(molId.toString());

        
AtomContainerManipulator.percieveAtomTypesAndConfigureUnsetProperties(mol);
        boolean isAromatic = 
CDKHueckelAromaticityDetector.detectAromaticity(mol);
        mol.setFlag(CDKConstants.ISAROMATIC, isAromatic);
        CDKHydrogenAdder hydrogenAdder = 
CDKHydrogenAdder.getInstance(mol.getBuilder());
        hydrogenAdder.addImplicitHydrogens(mol);

        return mol;
    }

So I should not need to do any configuration before subgraph searching but I 
need to. it explains why UIT is faster, because the 
removeHydrogensExceptSingleAndPreserveAtomID method does a lot of stuff. (copy 
all atoms + bonds).
Not adding hydrogens in above code (commenting it out) has no effect.
comparison.init(queryStructure, mol, true, false); is a lot faster than 
comparison.init(queryStructure, mol, false, false);
Maybe MDLV2000Reader does something wrong while creating a molecule that is 
fixed in removeHydrogensExceptSingleAndPreserveAtomID?

Regards,

Thomas


mol file example:

ZINC21972410    
  CDK     0105111003

 17 17  0  0  0  0  0  0  0  0999 V2000
    2.5359    5.6483   -0.1176 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5559    4.1229   -0.0002 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.2578    3.7205    1.2984 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.1205    3.5933    0.0107 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.1410    2.0865    0.0024 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.1994    1.4944   -0.0109 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0169    1.3968    0.0097 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0021   -0.0041    0.0020 N   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1558   -0.6938    0.0094 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2169   -0.1003    0.0227 O   0  0  0  0  0  0  0  0  0  0  0  0
   -1.1358   -2.1698    0.0013 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.3346   -2.8981    0.0147 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.2643   -4.2759    0.0123 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0172   -4.8934   -0.0033 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0967   -4.1779   -0.0212 N   0  0  0  0  0  0  0  0  0  0  0  0
    0.0758   -2.8614   -0.0198 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.9420   -6.2462   -0.0053 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0 
  2  3  1  0  0  0  0 
  2  4  1  0  0  0  0 
  4  5  1  0  0  0  0 
  5  6  2  0  0  0  0 
  5  7  1  0  0  0  0 
  7  8  1  0  0  0  0 
  8  9  1  0  0  0  0 
  9 10  2  0  0  0  0 
  9 11  1  0  0  0  0 
 11 16  2  0  0  0  0 
 11 12  1  0  0  0  0 
 12 13  2  0  0  0  0 
 13 14  1  0  0  0  0 
 14 17  2  0  0  0  0 
 14 15  1  0  0  0  0 
 15 16  1  0  0  0  0 
M  END


> From: [email protected]
> Date: Thu, 24 Feb 2011 13:18:50 +0100
> Subject: Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 
> Isomorphism Class
> To: [email protected]
> CC: [email protected]; [email protected]; 
> [email protected]
> 
> Hej Thomas,
> 
> On Thu, Feb 24, 2011 at 11:30 AM, Thomas Strunz <[email protected]> wrote:
> > Problem:
> > Can't filter based on H Atoms because not all P and S can be "typed"
> > correctly and hence the CDKHydrogenAdder fails and H-Count for the total
> > molecule is wrong
> 
> Good. That's something we can work on :)
> 
> See my post of this morning:
> 
> http://chem-bla-ics.blogspot.com/2011/02/adding-cdk-atom-type.html
> 
> We only need to figure out the six properties for the missing atom
> type, and the unit test needs an example structure. Preferably from
> PubChem, as I can convert that programmatically into CDK code, see:
> 
> http://chem-bla-ics.blogspot.com/2008/05/wicked-chemistry-and-unit-testing.html
> 
> Grtz,
> 
> Egon
> 
> -- 
> Dr E.L. Willighagen
> Postdoctoral Researcher
> Institutet för miljömedicin
> Karolinska Institutet
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: http://www.citeulike.org/user/egonw/tag/papers

------------------------------------------------------------------------------
Free Software Download: Index, Search & Analyze Logs and other IT data in 
Real-Time with Splunk. Collect, index and harness all the fast moving IT data 
generated by your applications, servers and devices whether physical, virtual
or in the cloud. Deliver compliance at lower cost and gain new business 
insights. http://p.sf.net/sfu/splunk-dev2dev

_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Re: [Cdk-user] Substructure Searching, Fingerprints and cdk-1.3.7 Isomorphism Class

Reply via email to