Hi Andrew,
On Sun, Dec 11, 2011 at 3:40 AM, Andrew Dalke <da...@dalkescientific.com> wrote:
>
> The testing approach looks good. I think it's better to use
> fragments against a data set, as you do, rather than how some
> I've seen who use whole molecules taken from the data set. I'm
> one of them, since I don't have a good set of queries.
>
> Could you package your queries up so I can use them?
The "pubchem pieces" queries are in svn:
http://rdkit.svn.sourceforge.net/viewvc/rdkit/trunk/Regress/Data/queries.txt.gz?revision=1137
I'm sure I can dig out the script I used to produce that set if you
want to make more.
> About a year ago I started experimenting with work based
> on subgraph enumeration. I wondered if some sort of brute-force
> greedy algorithm would find good substructure filter patterns.
> The preliminary results were very surprising - my test code
> needed surprisingly few bits to reduce the search space
> drastically. However, that's assuming the queries are also
> the targets.
>
>
> Your email the other day got me excited on working on the
> topic again, so I reassembled my code, finished a few incomplete
> parts, and fixed some bugs. Tomorrow I plan to do validation
> and timings.
>
>
> My greedy algorithm is rather slow and a memory hog so I'm
> only able to analyze the compounds in PubChem records up to
> 000175000. (That's the first 7 Compound files.) The first
> set of output is after my signature. I'm letting a more
> sensitive, and slower, version run overnight.
>
> Care to try out the patterns?
Assuming I implemented your FPs correctly (code attached), this is what I get:
#--------------------
zinc frags:
[05:26:04] INFO: FINISHED 50001 (25000500 total, 839780 searched, 7480
found) in 42.94
[05:26:04] INFO: screenout: 0.03, accuracy: 0.01
zinc leads:
[05:27:56] INFO: FINISHED 50001 (25000500 total, 280606 searched, 602
found) in 35.98
[05:27:56] INFO: screenout: 0.01, accuracy: 0.00
pubchem pieces:
[05:30:38] INFO: FINISHED 50001 (41150823 total, 2948920 searched,
964883 found) in 80.68
[05:30:38] INFO: screenout: 0.07, accuracy: 0.33
#--------------------
What would be interesting, but I haven't done yet, would be to look at
screenout accuracy as a function of size for the pubchem pieces.
-greg
"""
Results:
zinc frags:
[05:26:04] INFO: FINISHED 50001 (25000500 total, 839780 searched, 7480 found) in 42.94
[05:26:04] INFO: screenout: 0.03, accuracy: 0.01
zinc leads:
[05:27:56] INFO: FINISHED 50001 (25000500 total, 280606 searched, 602 found) in 35.98
[05:27:56] INFO: screenout: 0.01, accuracy: 0.00
pubchem pieces:
[05:30:38] INFO: FINISHED 50001 (41150823 total, 2948920 searched, 964883 found) in 80.68
[05:30:38] INFO: screenout: 0.07, accuracy: 0.33
"""
from rdkit import Chem,DataStructs
import cPickle,time,gzip
from rdkit.RDLogger import logger
logger=logger()
qs="""0 6 times C largest 76320
1 1 times Ccc largest 42679
2 2 times O largest 26880
3 1 times N largest 20104
4 6 times c largest 14837
5 1 times CO largest 11893
6 1 times CN largest 9855
7 1 times cn largest 9108
8 1 times CCC(C)C largest 7736
9 1 times CCC largest 5799
10 1 times O largest 4603
11 1 times C=CC largest 4337
12 1 times ccO largest 4051
13 2 times C largest 3097
14 9 times CC largest 2561
15 4 times O largest 2561
16 1 times Cl largest 1895
17 1 times CC=O largest 1895
18 1 times ccN largest 1708
19 1 times C largest 1295
20 2 times CCOC largest 1295
21 4 times C largest 1295
22 2 times cccc(c)c largest 1178
23 1 times CCCCC largest 1178
24 1 times F largest 848
25 1 times C1CCCCC1 largest 819
26 1 times cc largest 665
27 8 times C largest 665
28 9 times CCCCCC largest 665
29 1 times CC largest 665
30 3 times ccnc largest 608
31 2 times c1ccccc1 largest 608
32 1 times S largest 583
33 1 times CC1CCCC1 largest 583
34 1 times Br largest 583
35 3 times O largest 405
36 7 times O largest 405
37 1 times CCO largest 405
38 3 times Cc(c)cc largest 405
39 1 times cccco largest 323
40 1 times CS largest 323
41 1 times cccncc largest 323
42 3 times CCC largest 323
43 3 times Cl largest 323
44 1 times OP largest 323
45 1 times I largest 320
46 5 times c1ccccc1 largest 295
47 3 times F largest 277
48 1 times O=SO largest 260
49 3 times C largest 256
50 5 times CCCCCCC largest 256
51 5 times CCCC(C)C largest 256
52 1 times cc(c(cCl)Cl)Cl largest 256
53 2 times Cl largest 256
54 1 times NO largest 248
55 1 times O[Si] largest 247
56 1 times [Na] largest 247
57 1 times [Si] largest 244
58 9 times CCCCCCC largest 244
59 1 times CP largest 238
60 1 times [K] largest 238
61 1 times C[Si] largest 233
62 1 times C[Sn] largest 228
63 1 times BO largest 228"""
ssPatts=None
def GetSubstructFP(m,verbose=False):
global ssPatts
if ssPatts is None:
ssPatts=[]
for q in qs.split('\n'):
q = q.split(' ')
count = int(q[1])
patt = Chem.MolFromSmarts(q[3])
ssPatts.append((patt,count))
sz = len(ssPatts)
res = DataStructs.ExplicitBitVect(sz)
for i,(p,count) in enumerate(ssPatts):
matches = m.GetSubstructMatches(p,uniquify=True)
if len(matches)>=count:
res.SetBit(i)
return res
if __name__ == '__main__':
import random
logger.info('reading queries')
if 1:
pattData = gzip.open('fragqueries.txt.gz','rb').readlines()
patts = [Chem.MolFromSmiles(x) for x in pattData]
else:
pattData = [x.strip().split('\t')[0] for x in file('zinc.leads.diverse.smi')]
#pattData = [x.strip().split('\t')[0] for x in file('zinc.frags.diverse.smi')]
random.seed(23)
random.shuffle(pattData)
patts = [Chem.MolFromSmiles(x) for x in pattData[:520]]
patts = [x for x in patts if x is not None][:500]
logger.info('reading mols')
mols = []
inf=file('zinc.100k.pkl','rb')
while 1:
try:
id,m = cPickle.load(inf)
except EOFError:
break
else:
mols.append(m)
if len(mols)>50000: break
logger.info('generating pattern fps:')
pattFps = [GetSubstructFP(x) for x in patts]
logger.info('generating molecules fps:')
molFps = [GetSubstructFP(x) for x in mols]
pattMisses=[0]*len(pattFps)
pattHits=[0]*len(pattFps)
logger.info('scanning')
t1 = time.time()
nPossible=nMatched=nSearched=0
t1=time.time()
for j in range(len(mols)):
mfp = molFps[j]
for k in range(len(pattFps)):
pfp = pattFps[k]
nPossible+=1
if DataStructs.AllProbeBitsMatch(pfp,mfp):
fpMatch=True
nSearched+=1
else:
fpMatch=False
if fpMatch and mols[j].HasSubstructMatch(patts[k]):
nMatched+=1
if not fpMatch:
print 'mismatch:',j,k
print ' ',Chem.MolToSmiles(mols[j])
print ' ',Chem.MolToSmiles(patts[k])
if not (j+1)%500:
logger.info(' done %d (%d total, %d searched, %d found)'%(j+1,nPossible,nSearched,nMatched))
t2 = time.time()
logger.info('FINISHED %d (%d total, %d searched, %d found) in %.2f'%(j+1,nPossible,nSearched,nMatched,t2-t1))
logger.info(' screenout: %.2f, accuracy: %.2f'%(1.*nSearched/nPossible, 1.*nMatched/nSearched))
------------------------------------------------------------------------------
Learn Windows Azure Live! Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for
developers. It will provide a great way to learn Windows Azure and what it
provides. You can attend the event by watching it streamed LIVE online.
Learn more at http://p.sf.net/sfu/ms-windowsazure
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss