I trained my algorithm on the pubchem pieces as queries against ZINC, and got the following bits:
0 2 times O largest 55458 1 2 times Ccc largest 29602 2 1 times CCN largest 16829 3 1 times cnc largest 11439 4 1 times cN largest 8998 5 1 times C=O largest 7358 6 1 times CCC largest 6250 7 1 times S largest 4760 8 1 times c1ccccc1 largest 4524 9 2 times N largest 2854 10 1 times C=C largest 2162 11 1 times nn largest 1840 12 2 times CO largest 1248 13 1 times Ccn largest 964 14 1 times CCCCC largest 857 15 1 times cc(c)c largest 653 16 3 times O largest 653 17 1 times O largest 466 18 2 times CNC largest 464 19 1 times s largest 457 20 1 times CC(C)C largest 335 21 1 times o largest 334 22 1 times cncnc largest 334 23 1 times C=N largest 321 24 2 times CC=O largest 238 25 4 times Ccc largest 238 26 1 times Cl largest 230 27 4 times O largest 149 28 2 times ccncc largest 149 29 6 times CCCCCC largest 76 30 2 times c1ccccc1 largest 76 31 1 times F largest 75 32 3 times CCOC largest 44 33 3 times N largest 44 34 1 times c(cn)n largest 44 35 1 times N largest 41 36 9 times C largest 41 37 1 times CC=C(C)C largest 33 38 1 times c1ccncc1 largest 26 39 1 times CC(C)N largest 26 40 1 times CC largest 26 41 4 times CCC(C)O largest 25 42 2 times ccc(cc)n largest 21 43 6 times C largest 21 44 1 times C1CCCC1 largest 18 45 1 times C largest 18 46 5 times O largest 18 47 2 times Ccn largest 14 48 1 times CNCN largest 13 49 3 times cncn largest 13 50 1 times CSC largest 13 51 3 times CC=O largest 11 52 1 times CCNCCCN largest 11 53 1 times CccC largest 11 54 3 times ccccc(c)c largest 10 [20:41:25] INFO: FINISHED 50001 (41150823 total, 2001442 searched, 893822 found) in 90.68 [20:41:25] INFO: screenout: 0.05, accuracy: 0.45 Since I was curious, I did the evaluation at different bit sizes: 1 bit : screenout: 0.77, accuracy: 0.03 in 363.03 seconds (988264 found) 8 bits: screenout: 0.21, accuracy: 0.11 in 166.48 seconds (947411 found) 9 bits: screenout: 0.18, accuracy: 0.13 in 151.55 seconds (947411 found) 16 bits: screenout: 0.11, accuracy: 0.21 in 123.12 seconds (941417 found) 17 bits: screenout: 0.11, accuracy: 0.21 in 122.03 seconds (941417 found) 18 bits: screenout: 0.10, accuracy: 0.22 in 119.35 seconds (939757 found) [& one O match] 19 bits: screenout: 0.10, accuracy: 0.23 in 118.23 seconds (938363 found) [& 2 unique CNC matches] 20 bits: screenout: 0.08, accuracy: 0.30 in 108.38 seconds (933944 found) [& one s match] 32 bits: screenout: 0.06, accuracy: 0.37 in 97.38 seconds (933098 found) 40 bits: screenout: 0.05, accuracy: 0.42 in 97.26 seconds (915296 found) 48 bits: screenout: 0.05, accuracy: 0.44 in 91.79 seconds (896178 found) 55 bits: screenout: 0.05, accuracy: 0.45 in 90.68 seconds (893822 found) I'm very concerned about the differences in the number found at different bit sizes. Earlier when I saw a difference between what your machine reported and what mine reported, I thought it was a difference between our versions of RDKit, but I see that as I add bit patterns, I find fewer hits. That means that the fingerprint screen isn't working as I thought it would. I don't see anything wrong in my pattern definitions. They should work perfectly as substructure filters and it should always report the same number of hits found. I did more tests around 16-20 bits to isolate a bit which triggers the problem. You can see that they are tests for things like "has an aliphatic oxygen" and "has an aromatic sulfur", which shouldn't cause any problems. Greg, can you enlighten me as to why the number found changes as I add more bits? For comparison, I also generated the 871 substructure keys I've developed, which are cross-toolkit SMARTS patterns derived closely from CACTVS/PubChem substructure keys. After 10 minutes and 37 seconds of SMARTS matching to generate the structure fingerprints, it started searching. PubChem/CACTVS fingerprint screen results: [22:16:21] INFO: FINISHED 50001 (41150823 total, 573179 searched, 327042 found) in 98.06 [22:16:21] INFO: screenout: 0.01, accuracy: 0.57 Hmmmm. Only 327042 found? Leaving all that aside, it's pretty clear that my method doesn't do well with negatives, that is, with queries containing substructures that aren't in the targets. I threw away all of the patterns which I knew were in the training set and not in the targets, when I should be using it somehow. Perhaps my method is useful to enrich an existing fingerprint? Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Learn Windows Azure Live! Tuesday, Dec 13, 2011 Microsoft is holding a special Learn Windows Azure training event for developers. It will provide a great way to learn Windows Azure and what it provides. You can attend the event by watching it streamed LIVE online. Learn more at http://p.sf.net/sfu/ms-windowsazure _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss