I trained my algorithm on the pubchem pieces as queries against ZINC, and got 
the following bits:

0 2 times O largest 55458
1 2 times Ccc largest 29602
2 1 times CCN largest 16829
3 1 times cnc largest 11439
4 1 times cN largest 8998
5 1 times C=O largest 7358
6 1 times CCC largest 6250
7 1 times S largest 4760
8 1 times c1ccccc1 largest 4524
9 2 times N largest 2854
10 1 times C=C largest 2162
11 1 times nn largest 1840
12 2 times CO largest 1248
13 1 times Ccn largest 964
14 1 times CCCCC largest 857
15 1 times cc(c)c largest 653
16 3 times O largest 653
17 1 times O largest 466
18 2 times CNC largest 464
19 1 times s largest 457
20 1 times CC(C)C largest 335
21 1 times o largest 334
22 1 times cncnc largest 334
23 1 times C=N largest 321
24 2 times CC=O largest 238
25 4 times Ccc largest 238
26 1 times Cl largest 230
27 4 times O largest 149
28 2 times ccncc largest 149
29 6 times CCCCCC largest 76
30 2 times c1ccccc1 largest 76
31 1 times F largest 75
32 3 times CCOC largest 44
33 3 times N largest 44
34 1 times c(cn)n largest 44
35 1 times N largest 41
36 9 times C largest 41
37 1 times CC=C(C)C largest 33
38 1 times c1ccncc1 largest 26
39 1 times CC(C)N largest 26
40 1 times CC largest 26
41 4 times CCC(C)O largest 25
42 2 times ccc(cc)n largest 21
43 6 times C largest 21
44 1 times C1CCCC1 largest 18
45 1 times C largest 18
46 5 times O largest 18
47 2 times Ccn largest 14
48 1 times CNCN largest 13
49 3 times cncn largest 13
50 1 times CSC largest 13
51 3 times CC=O largest 11
52 1 times CCNCCCN largest 11
53 1 times CccC largest 11
54 3 times ccccc(c)c largest 10


[20:41:25] INFO: FINISHED 50001 (41150823 total, 2001442 searched, 893822 
found) in 90.68
[20:41:25] INFO:   screenout: 0.05, accuracy: 0.45

Since I was curious, I did the evaluation at different bit sizes:

1 bit : screenout: 0.77, accuracy: 0.03  in 363.03 seconds (988264 found)

8 bits: screenout: 0.21, accuracy: 0.11  in 166.48 seconds (947411 found)
9 bits: screenout: 0.18, accuracy: 0.13  in 151.55 seconds (947411 found)

16 bits: screenout: 0.11, accuracy: 0.21  in 123.12 seconds (941417 found)
17 bits: screenout: 0.11, accuracy: 0.21  in 122.03 seconds (941417 found)
18 bits: screenout: 0.10, accuracy: 0.22  in 119.35 seconds (939757 found) [& 
one O match]
19 bits: screenout: 0.10, accuracy: 0.23  in 118.23 seconds (938363 found) [& 2 
unique CNC matches]
20 bits: screenout: 0.08, accuracy: 0.30  in 108.38 seconds (933944 found) [& 
one s match]

32 bits: screenout: 0.06, accuracy: 0.37  in  97.38 seconds (933098 found)
40 bits: screenout: 0.05, accuracy: 0.42  in  97.26 seconds (915296 found)
48 bits: screenout: 0.05, accuracy: 0.44  in  91.79 seconds (896178 found)
55 bits: screenout: 0.05, accuracy: 0.45  in  90.68 seconds (893822 found)


I'm very concerned about the differences in the number found at different bit 
sizes. Earlier when I saw a difference between what your machine reported and 
what mine reported, I thought it was a difference between our versions of 
RDKit, but I see that as I add bit patterns, I find fewer hits. That means that 
the fingerprint screen isn't working as I thought it would.

I don't see anything wrong in my pattern definitions. They should work 
perfectly as substructure filters and it should always report the same number 
of hits found.

I did more tests around 16-20 bits to isolate a bit which triggers the problem. 
You can see that they are tests for things like "has an aliphatic oxygen" and 
"has an aromatic sulfur", which shouldn't cause any problems.

Greg, can you enlighten me as to why the number found changes as I add more 
bits?


For comparison, I also generated the 871 substructure keys I've developed, 
which are cross-toolkit SMARTS patterns derived closely from CACTVS/PubChem 
substructure keys. After 10 minutes and 37 seconds of SMARTS matching to 
generate the structure fingerprints, it started searching.

PubChem/CACTVS fingerprint screen results:

[22:16:21] INFO: FINISHED 50001 (41150823 total, 573179 searched, 327042 found) 
in 98.06
[22:16:21] INFO:   screenout: 0.01, accuracy: 0.57


Hmmmm. Only 327042 found?


Leaving all that aside, it's pretty clear that my method doesn't do well with 
negatives, that is, with queries containing substructures that aren't in the 
targets. I threw away all of the patterns which I knew were in the training set 
and not in the targets, when I should be using it somehow.

Perhaps my method is useful to enrich an existing fingerprint?



                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Learn Windows Azure Live!  Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for 
developers. It will provide a great way to learn Windows Azure and what it 
provides. You can attend the event by watching it streamed LIVE online.  
Learn more at http://p.sf.net/sfu/ms-windowsazure
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to