Re: [Rdkit-discuss] request for feedback/testing

Greg Landrum Sat, 10 Dec 2011 21:36:56 -0800

Hi Andrew,

On Sun, Dec 11, 2011 at 3:40 AM, Andrew Dalke <da...@dalkescientific.com> wrote:
>
>  The testing approach looks good. I think it's better to use
> fragments against a data set, as you do, rather than how some
> I've seen who use whole molecules taken from the data set. I'm
> one of them, since I don't have a good set of queries.
>
>  Could you package your queries up so I can use them?


The "pubchem pieces" queries are in svn:
http://rdkit.svn.sourceforge.net/viewvc/rdkit/trunk/Regress/Data/queries.txt.gz?revision=1137
I'm sure I can dig out the script I used to produce that set if you
want to make more.

>  About a year ago I started experimenting with work based
> on subgraph enumeration. I wondered if some sort of brute-force
> greedy algorithm would find good substructure filter patterns.
> The preliminary results were very surprising - my test code
> needed surprisingly few bits to reduce the search space
> drastically. However, that's assuming the queries are also
> the targets.
>
>
>  Your email the other day got me excited on working on the
> topic again, so I reassembled my code, finished a few incomplete
> parts, and fixed some bugs. Tomorrow I plan to do validation
> and timings.
>
>
>  My greedy algorithm is rather slow and a memory hog so I'm
> only able to analyze the compounds in PubChem records up to
> 000175000. (That's the first 7 Compound files.) The first
> set of output is after my signature. I'm letting a more
> sensitive, and slower, version run overnight.
>
>  Care to try out the patterns?

Assuming I implemented your FPs correctly (code attached), this is what I get:

#--------------------
zinc frags:
[05:26:04] INFO: FINISHED 50001 (25000500 total, 839780 searched, 7480
found) in 42.94
[05:26:04] INFO:   screenout: 0.03, accuracy: 0.01

zinc leads:
[05:27:56] INFO: FINISHED 50001 (25000500 total, 280606 searched, 602
found) in 35.98
[05:27:56] INFO:   screenout: 0.01, accuracy: 0.00

pubchem pieces:
[05:30:38] INFO: FINISHED 50001 (41150823 total, 2948920 searched,
964883 found) in 80.68
[05:30:38] INFO:   screenout: 0.07, accuracy: 0.33
#--------------------

What would be interesting, but I haven't done yet, would be to look at
screenout accuracy as a function of size for the pubchem pieces.

-greg

"""
Results:

zinc frags:
[05:26:04] INFO: FINISHED 50001 (25000500 total, 839780 searched, 7480 found) in 42.94
[05:26:04] INFO:   screenout: 0.03, accuracy: 0.01

zinc leads:
[05:27:56] INFO: FINISHED 50001 (25000500 total, 280606 searched, 602 found) in 35.98
[05:27:56] INFO:   screenout: 0.01, accuracy: 0.00

pubchem pieces:
[05:30:38] INFO: FINISHED 50001 (41150823 total, 2948920 searched, 964883 found) in 80.68
[05:30:38] INFO:   screenout: 0.07, accuracy: 0.33



"""
from rdkit import Chem,DataStructs
import cPickle,time,gzip 
from rdkit.RDLogger import logger
logger=logger()

qs="""0 6 times C largest 76320
1 1 times Ccc largest 42679
2 2 times O largest 26880
3 1 times N largest 20104
4 6 times c largest 14837
5 1 times CO largest 11893
6 1 times CN largest 9855
7 1 times cn largest 9108
8 1 times CCC(C)C largest 7736
9 1 times CCC largest 5799
10 1 times O largest 4603
11 1 times C=CC largest 4337
12 1 times ccO largest 4051
13 2 times C largest 3097
14 9 times CC largest 2561
15 4 times O largest 2561
16 1 times Cl largest 1895
17 1 times CC=O largest 1895
18 1 times ccN largest 1708
19 1 times C largest 1295
20 2 times CCOC largest 1295
21 4 times C largest 1295
22 2 times cccc(c)c largest 1178
23 1 times CCCCC largest 1178
24 1 times F largest 848
25 1 times C1CCCCC1 largest 819
26 1 times cc largest 665
27 8 times C largest 665
28 9 times CCCCCC largest 665
29 1 times CC largest 665
30 3 times ccnc largest 608
31 2 times c1ccccc1 largest 608
32 1 times S largest 583
33 1 times CC1CCCC1 largest 583
34 1 times Br largest 583
35 3 times O largest 405
36 7 times O largest 405
37 1 times CCO largest 405
38 3 times Cc(c)cc largest 405
39 1 times cccco largest 323
40 1 times CS largest 323
41 1 times cccncc largest 323
42 3 times CCC largest 323
43 3 times Cl largest 323
44 1 times OP largest 323
45 1 times I largest 320
46 5 times c1ccccc1 largest 295
47 3 times F largest 277
48 1 times O=SO largest 260
49 3 times C largest 256
50 5 times CCCCCCC largest 256
51 5 times CCCC(C)C largest 256
52 1 times cc(c(cCl)Cl)Cl largest 256
53 2 times Cl largest 256
54 1 times NO largest 248
55 1 times O[Si] largest 247
56 1 times [Na] largest 247
57 1 times [Si] largest 244
58 9 times CCCCCCC largest 244
59 1 times CP largest 238
60 1 times [K] largest 238
61 1 times C[Si] largest 233
62 1 times C[Sn] largest 228
63 1 times BO largest 228"""


ssPatts=None
def GetSubstructFP(m,verbose=False):
    global ssPatts
    if ssPatts is None:
        ssPatts=[]
        for q in qs.split('\n'):
            q = q.split(' ')
            count = int(q[1])
            patt = Chem.MolFromSmarts(q[3])
            ssPatts.append((patt,count))
    sz = len(ssPatts)
    res = DataStructs.ExplicitBitVect(sz)
    for i,(p,count) in enumerate(ssPatts):
        matches = m.GetSubstructMatches(p,uniquify=True)
        if len(matches)>=count:
            res.SetBit(i)
    return res
            

if __name__ == '__main__':
    import random
    logger.info('reading queries')
    if 1:
        pattData = gzip.open('fragqueries.txt.gz','rb').readlines()
        patts = [Chem.MolFromSmiles(x) for x in pattData]
    else:
        pattData = [x.strip().split('\t')[0] for x in file('zinc.leads.diverse.smi')]
        #pattData = [x.strip().split('\t')[0] for x in file('zinc.frags.diverse.smi')]
        random.seed(23)
        random.shuffle(pattData)
        patts = [Chem.MolFromSmiles(x) for x in pattData[:520]]
        patts = [x for x in patts if x is not None][:500]

    logger.info('reading mols')
    mols = []
    inf=file('zinc.100k.pkl','rb')
    while 1:
        try:
            id,m = cPickle.load(inf)
        except EOFError:
            break
        else:
            mols.append(m)
            if len(mols)>50000: break


    logger.info('generating pattern fps:')
    pattFps = [GetSubstructFP(x) for x in patts]

    logger.info('generating molecules fps:')
    molFps = [GetSubstructFP(x) for x in mols]

    pattMisses=[0]*len(pattFps)
    pattHits=[0]*len(pattFps)

    logger.info('scanning')
    t1 = time.time()
    nPossible=nMatched=nSearched=0
    t1=time.time()
    for j in range(len(mols)):
        mfp = molFps[j]
        for k in range(len(pattFps)):
            pfp = pattFps[k]
            nPossible+=1
            if DataStructs.AllProbeBitsMatch(pfp,mfp):
                fpMatch=True
                nSearched+=1
            else:
                fpMatch=False
            if fpMatch and mols[j].HasSubstructMatch(patts[k]):
                nMatched+=1
                if not fpMatch:
                    print 'mismatch:',j,k
                    print '  ',Chem.MolToSmiles(mols[j])
                    print '  ',Chem.MolToSmiles(patts[k])
        if not (j+1)%500:
            logger.info('  done %d (%d total, %d searched, %d found)'%(j+1,nPossible,nSearched,nMatched))
    t2 = time.time()
    logger.info('FINISHED %d (%d total, %d searched, %d found) in %.2f'%(j+1,nPossible,nSearched,nMatched,t2-t1))
    logger.info('  screenout: %.2f, accuracy: %.2f'%(1.*nSearched/nPossible, 1.*nMatched/nSearched))

------------------------------------------------------------------------------
Learn Windows Azure Live!  Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for 
developers. It will provide a great way to learn Windows Azure and what it 
provides. You can attend the event by watching it streamed LIVE online.  
Learn more at http://p.sf.net/sfu/ms-windowsazure

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] request for feedback/testing

Reply via email to