Hi Greg,

  The testing approach looks good. I think it's better to use
fragments against a data set, as you do, rather than how some
I've seen who use whole molecules taken from the data set. I'm
one of them, since I don't have a good set of queries.

  Could you package your queries up so I can use them?

  I think it would be better to have a set of actual queries,
instead of synthetic fragments. When I last looked (two
years ago) I couldn't find any. Last year at Goslar I heard
about a substructure query data used in someone's Master's
thesis, but I haven't tracked it down. In spring I talked with
Mike Gilson at BindingDB, and he might be able to provide actual
data from that database.

  I've been working on other things since then, but I'll see
if I can dig something up.

  About a year ago I started experimenting with work based
on subgraph enumeration. I wondered if some sort of brute-force
greedy algorithm would find good substructure filter patterns.
The preliminary results were very surprising - my test code
needed surprisingly few bits to reduce the search space
drastically. However, that's assuming the queries are also
the targets.


  Your email the other day got me excited on working on the
topic again, so I reassembled my code, finished a few incomplete
parts, and fixed some bugs. Tomorrow I plan to do validation
and timings.


  My greedy algorithm is rather slow and a memory hog so I'm
only able to analyze the compounds in PubChem records up to
000175000. (That's the first 7 Compound files.) The first
set of output is after my signature. I'm letting a more
sensitive, and slower, version run overnight.

  Care to try out the patterns?


                                Andrew
                                da...@dalkescientific.com

Here are the SMARTS-based substructure filtering patterns.

Column 1 is the bit number,
Column 2 is the number of times the pattern exists uniquely
  (obviously the unique counts of C == the non-unique counts of C)
Column 4 is the SMARTS patterns
Column 6 is the size of the largest group which all of the same fingerprint 
prefix to this point


I started off with 151,114 compounds and 19,650 patterns, so the output for bit 
0 shows that 76,320 of those had 6 carbons and 74,794 do not (or vice versa).

0 6 times C largest 76320
1 1 times Ccc largest 42679
2 2 times O largest 26880
3 1 times N largest 20104
4 6 times c largest 14837
5 1 times CO largest 11893
6 1 times CN largest 9855
7 1 times cn largest 9108
8 1 times CCC(C)C largest 7736
9 1 times CCC largest 5799
10 1 times O largest 4603
11 1 times C=CC largest 4337
12 1 times ccO largest 4051
13 2 times C largest 3097
14 9 times CC largest 2561
15 4 times O largest 2561
16 1 times Cl largest 1895
17 1 times CC=O largest 1895
18 1 times ccN largest 1708
19 1 times C largest 1295
20 2 times CCOC largest 1295
21 4 times C largest 1295
22 2 times cccc(c)c largest 1178
23 1 times CCCCC largest 1178
24 1 times F largest 848
25 1 times C1CCCCC1 largest 819
26 1 times cc largest 665
27 8 times C largest 665
28 9 times CCCCCC largest 665
29 1 times CC largest 665
30 3 times ccnc largest 608
31 2 times c1ccccc1 largest 608
32 1 times S largest 583
33 1 times CC1CCCC1 largest 583
34 1 times Br largest 583
35 3 times O largest 405
36 7 times O largest 405
37 1 times CCO largest 405
38 3 times Cc(c)cc largest 405
39 1 times cccco largest 323
40 1 times CS largest 323
41 1 times cccncc largest 323
42 3 times CCC largest 323
43 3 times Cl largest 323
44 1 times OP largest 323
45 1 times I largest 320
46 5 times c1ccccc1 largest 295
47 3 times F largest 277
48 1 times O=SO largest 260
49 3 times C largest 256
50 5 times CCCCCCC largest 256
51 5 times CCCC(C)C largest 256
52 1 times cc(c(cCl)Cl)Cl largest 256
53 2 times Cl largest 256
54 1 times NO largest 248
55 1 times O[Si] largest 247
56 1 times [Na] largest 247
57 1 times [Si] largest 244
58 9 times CCCCCCC largest 244
59 1 times CP largest 238
60 1 times [K] largest 238
61 1 times C[Si] largest 233
62 1 times C[Sn] largest 228
63 1 times BO largest 228

Were I to make a real screening fingerprint out of this, I would include all of 
the reject single-atom patterns (the '[U]' and '[Dy]' and such) in one final 
pattern. Otherwise the idea the '[U]' would trigger an entire database search 
would bother me.



------------------------------------------------------------------------------
Learn Windows Azure Live!  Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for 
developers. It will provide a great way to learn Windows Azure and what it 
provides. You can attend the event by watching it streamed LIVE online.  
Learn more at http://p.sf.net/sfu/ms-windowsazure
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to