Hi, I had encountered with similar problems in the past.

a) The present CDK hashed fingerprint does not discriminate between open and 
closed ring system but its very fast
b) Extended fingerprint does the job but the code needs some tweaking.

Apart from this all the fingerprint codes suffer from bit clashes (universal 
truth...) hence its hard to get one to one mapping between Graph isomorphism 
and fingerprint.

The best strategy for screening is to use fingerprints (generic in nature not 
pharmacophore) to generate an ensemble of potential hits. On this ensemble run 
graph isomorphism to eliminate false positives.

I have optimised the CDK hashed fingerprint and it's fast, minimises the bit 
clashed and discriminates between rings systems if asked to do so.

Here is the code

https://github.com/asad/CDKHashFingerPrint/blob/master/src/fingerprints/HashedFingerprinter.java
 


All you need is following steps

a) global

fingerprints.interfaces.IFingerprinter fingerprint1 = new 
fingerprints.HashedFingerprinter(1024);
fingerprint1.setRespectRingMatches(true);

b) function

private static BitSet getHashedFingerprint(IAtomContainer ac) throws 
CDKException {
        return fingerprint1.getFingerprint(ac);
    }

If you intesrted in the benchmark code, you can find it

https://github.com/asad/CDKHashFingerPrint

Hope this helps.


Asad

On 19 Dec 2011, at 10:40, [email protected] wrote:

> Send Cdk-user mailing list submissions to
>       [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       https://lists.sourceforge.net/lists/listinfo/cdk-user
> or, via email, send a message with subject or body 'help' to
>       [email protected]
> 
> You can reach the person managing the list at
>       [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Cdk-user digest..."
> 
> 
> Today's Topics:
> 
>   1. SDF Problem (lochana menikarachchi)
>   2. Re: Correctness of Fingerprinters and
>      UniversalisomorphismTester (Joos Kiener)
>   3. Re: Correctness of Fingerprinters and
>      UniversalisomorphismTester (Egon Willighagen)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Sun, 18 Dec 2011 09:17:35 -0800 (PST)
> From: lochana menikarachchi <[email protected]>
> Subject: [Cdk-user] SDF Problem
> To: "[email protected]" <[email protected]>
> Message-ID:
>       <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
> 
>> Can you create a MDL V2000 molfile with FC(Cl)BrI as compound, make
> 
>> sure it has the stereo field, *and* let me know if that file
>> represents the R or S species? Then I will write a patch to add this
>> functionality.
> 
> Attached compound CID_79058.sdf is S isomer of FC(Cl)BrH downloaded from 
> pubchem. However, it is not just stereo information missing. Look at the 
> second compound (73393) and see how cdk writes that compound. both columns 6 
> (charge) and 7 (stereo parity) get lost. Some programs rely on information on 
> these columns and it is nice to have SDF in the same format as oechem and 
> Marvin. Also CDK writes CHG cards differently (Check 73393)
> 
> Thanks.
> 
> Lochana
> -------------- next part --------------
> An HTML attachment was scrubbed...
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: CID_73393.sdf
> Type: application/octet-stream
> Size: 6227 bytes
> Desc: not available
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: CID_79058.sdf
> Type: application/octet-stream
> Size: 2017 bytes
> Desc: not available
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 19 Dec 2011 11:10:20 +0100
> From: Joos Kiener <[email protected]>
> Subject: Re: [Cdk-user] Correctness of Fingerprinters and
>       UniversalisomorphismTester
> To: [email protected]
> Message-ID:
>       <cahjbz71odgxa3cwydkymsshr-dv_9qfwg2seq2b7mt19ge5...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Hi all,
> 
> any comments on this?
> 
> 2011/12/8 Joos Kiener <[email protected]>
> 
>> Hi all,
>> 
>> if have a question regarding how it is verified that Fingerprinters
>> actually works correctly as well as Universalisomorphism Tester?
>> 
>> The Question is related to the cdk based project I'm working on which I
>> will "officially release" once I believe it is usable enough.
>> 
>> I use UIT for Subgraph matching and the ExtendedFingerprinter. I had the
>> feeling that the fingerprint wasn't especially great at least for the used
>> dataset (Part of Subset 13 of ZINC) and hence I wanted to try out the
>> PubchemFingerprinter which I did put now I was getting different amount of
>> search hits than before. See below tables. I'm now wondering if it is a bug
>> on my part or in the Fingerprints and/or UIT. How can I determine the
>> actually correct result? Especially since the reference also disagrees with
>> UIT.
>> 
>> PubchemFingerprinter:
>> 
>> SMILES                    Screening Hits    Hits
>> CCC(C)C(C)C(C)C               8599         344
>> O(C)C(C)C(C)C(C)C              938            28
>> CCCCCC(C)CC                   9227        1547
>> N(C)(C)CC(C)C                  15861        8893
>> O(CC)C(N(C)C)C                 1365            83
>> CC(C)C(C)C(C(C)C)C(C)C    8599              0
>> 
>> ExtendedFingerprinter
>> 
>> SMILES                    Screening Hits    Hits
>> CCC(C)C(C)C(C)C                22488        429
>> O(C)C(C)C(C)C(C)C               9398          77
>> CCCCCC(C)CC                     3955       1603
>> N(C)(C)CC(C)C                    88301     10917
>> O(CC)C(N(C)C)C                   1588          74
>> CC(C)C(C)C(C(C)C)C(C)C     22488           0
>> 
>> No Screening, just UIT:
>> 
>> SMILES                                              Hits
>> CCC(C)C(C)C(C)C                                436
>> O(C)C(C)C(C)C(C)C                                77
>> CCCCCC(C)CC                                   2171
>> N(C)(C)CC(C)C                                  11412
>> O(CC)C(N(C)C)C                                   139
>> CC(C)C(C)C(C(C)C)C(C)C                         0
>> 
>> As a Reference the same Searches were done in ChemFinder over the same
>> Data Set
>> 
>> SMILES                        Hits Found in ChemFinder
>> CCC(C)C(C)C(C)C                              427
>> O(C)C(C)C(C)C(C)C                             77
>> CCCCCC(C)CC                                1825
>> N(C)(C)CC(C)C                               11412
>> O(CC)C(N(C)C)C                                109
>> CC(C)C(C)C(C(C)C)C(C)C                       0
>> 
>> Best Regards,
>> 
>> Joos
>> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 19 Dec 2011 11:40:10 +0100
> From: Egon Willighagen <[email protected]>
> Subject: Re: [Cdk-user] Correctness of Fingerprinters and
>       UniversalisomorphismTester
> To: Joos Kiener <[email protected]>
> Cc: [email protected]
> Message-ID:
>       <campqvy-pzwou1porycdpxppmh0q2gkzfa4xu0m2mbymsuaq...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi Joos,
> 
> a short, quick reply... I will not have time to look in detail into
> the issue in the next two weeks...
> 
> On Thu, Dec 8, 2011 at 12:47 PM, Joos Kiener <[email protected]> wrote:
>> The Question is related to the cdk based project I'm working on which I will
>> "officially release" once I believe it is usable enough.
> 
> That would be the 1.4 series.
> 
>> I use UIT for Subgraph matching and the ExtendedFingerprinter. I had the
>> feeling that the fingerprint wasn't especially great at least for the used
>> dataset (Part of Subset 13 of ZINC) and hence I wanted to try out the
>> PubchemFingerprinter which I did put now I was getting different amount of
>> search hits than before. See below tables. I'm now wondering if it is a bug
>> on my part or in the Fingerprints and/or UIT. How can I determine the
>> actually correct result? Especially since the reference also disagrees with
>> UIT.
>> 
>> PubchemFingerprinter:
>> 
>> SMILES??? ??? ??? ??? ??? Screening Hits??? Hits
>> CCC(C)C(C)C(C)C??? ??? ??? ?? 8599??? ?? ? 344
>> 
>> ExtendedFingerprinter
>> 
>> SMILES??? ??? ??? ??? ??? Screening Hits??? Hits
>> CCC(C)C(C)C(C)C??? ??? ??? ??? 22488??????? 429
>> 
>> No Screening, just UIT:
>> 
>> SMILES????????????????????????????????????????????? Hits
>> CCC(C)C(C)C(C)C??????????????????????????????? 436
>> 
>> As a Reference the same Searches were done in ChemFinder over the same Data
>> Set
>> 
>> SMILES??? ??? ??? ??? ??? ??? Hits Found in ChemFinder
>> CCC(C)C(C)C(C)C??? ??? ??? ??? ??? ??? ??? ? 427
> 
> So, one would expect to find 436 with the CDK for each of the three
> approaches. The difference with 427 in ChemFinder can have many
> reasons (preprocessing, their substructure matching, ...) and am not
> eager to hypothesize on why that is different.
> 
> It is indeed worrying to see that apparently the PubchemFingerprinter
> and ExtendedFingerprinter miss out on a true positives. Can you
> identify those structures? Maybe to start with the seven that the
> ExtendedFingerprinter doesn't find. Then we can start debugging why
> those are not found...
> 
> Egon
> 
> -- 
> Dr E.L. Willighagen
> Postdoctoral Researcher
> Institutet f?r milj?medicin
> Karolinska Institutet (http://ki.se/imm)
> Homepage: http://egonw.github.com/
> LinkedIn: http://se.linkedin.com/in/egonw
> Blog: http://chem-bla-ics.blogspot.com/
> PubList: http://www.citeulike.org/user/egonw/tag/papers
> 
> 
> 
> ------------------------------
> 
> ------------------------------------------------------------------------------
> Learn Windows Azure Live!  Tuesday, Dec 13, 2011
> Microsoft is holding a special Learn Windows Azure training event for 
> developers. It will provide a great way to learn Windows Azure and what it 
> provides. You can attend the event by watching it streamed LIVE online.  
> Learn more at http://p.sf.net/sfu/ms-windowsazure
> 
> ------------------------------
> 
> _______________________________________________
> Cdk-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/cdk-user
> 
> 
> End of Cdk-user Digest, Vol 67, Issue 8
> ***************************************

------------------------------------------------------------------------------
Learn Windows Azure Live!  Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for 
developers. It will provide a great way to learn Windows Azure and what it 
provides. You can attend the event by watching it streamed LIVE online.  
Learn more at http://p.sf.net/sfu/ms-windowsazure
_______________________________________________
Cdk-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to