Re: [Rdkit-discuss] request for feedback/testing

Greg Landrum Sun, 11 Dec 2011 21:19:09 -0800

On Sun, Dec 11, 2011 at 10:33 PM, Andrew Dalke
<da...@dalkescientific.com> wrote:
>
>
>
> Since I was curious, I did the evaluation at different bit sizes:
>
> 1 bit : screenout: 0.77, accuracy: 0.03  in 363.03 seconds (988264 found)
>
> 8 bits: screenout: 0.21, accuracy: 0.11  in 166.48 seconds (947411 found)
> 9 bits: screenout: 0.18, accuracy: 0.13  in 151.55 seconds (947411 found)
>
> 16 bits: screenout: 0.11, accuracy: 0.21  in 123.12 seconds (941417 found)
> 17 bits: screenout: 0.11, accuracy: 0.21  in 122.03 seconds (941417 found)
> 18 bits: screenout: 0.10, accuracy: 0.22  in 119.35 seconds (939757 found) [& 
> one O match]
> 19 bits: screenout: 0.10, accuracy: 0.23  in 118.23 seconds (938363 found) [& 
> 2 unique CNC matches]
> 20 bits: screenout: 0.08, accuracy: 0.30  in 108.38 seconds (933944 found) [& 
> one s match]
>
> 32 bits: screenout: 0.06, accuracy: 0.37  in  97.38 seconds (933098 found)
> 40 bits: screenout: 0.05, accuracy: 0.42  in  97.26 seconds (915296 found)
> 48 bits: screenout: 0.05, accuracy: 0.44  in  91.79 seconds (896178 found)
> 55 bits: screenout: 0.05, accuracy: 0.45  in  90.68 seconds (893822 found)
>
>
> I'm very concerned about the differences in the number found at different bit 
> sizes. Earlier when I saw a difference between what your machine reported and 
> what mine reported, I thought it was a difference between our versions of 
> RDKit, but I see that as I add bit patterns, I find fewer hits. That means 
> that the fingerprint screen isn't working as I thought it would.
>
> I don't see anything wrong in my pattern definitions. They should work 
> perfectly as substructure filters and it should always report the same number 
> of hits found.
>
> I did more tests around 16-20 bits to isolate a bit which triggers the 
> problem. You can see that they are tests for things like "has an aliphatic 
> oxygen" and "has an aromatic sulfur", which shouldn't cause any problems.
>
> Greg, can you enlighten me as to why the number found changes as I add more 
> bits?


Yeah, I should have noticed this when I sent you the original results.
It's easy to track down these problems using the script I sent by
changing line 159 (I think) from:
            if fpMatch and mols[j].HasSubstructMatch(patts[k]):
to
            if mols[j].HasSubstructMatch(patts[k]):
that runs the substructure search every time and reports when the
fingerprint screen prevents you from finding a match. I had it turned
off because it makes the timing information useless and takes longer
to run tests.

Here's the first failure I get:
  mismatch: 0 199
     Oc1c2nc[nH]c2c(O)nn1
     N
The RDKit builds the molecules from SMILES, so the top molecule does
have a substructure match for the lower one.

My script interprets your fingerprint bit definitions as SMARTS, so
your new bit 9:
"9 2 times N largest 2854"
is 2 times aliphatic N.

Clearly the query molecule sets that bit while the pool molecule does not.

An easy solution is to interpret your query molecules as SMILES and to
disable the RDKit sanitization. You don't have anything in the queries
that requires a SMARTS parser, so this works just fine.

Doing this with your second bit definitions I get the following for
the pubchem pieces queries:

[06:04:27] INFO: FINISHED 50001 (41150823 total, 2888986 searched,
1118055 found) in 81.77
[06:04:27] INFO:   screenout: 0.07, accuracy: 0.39


>
> For comparison, I also generated the 871 substructure keys I've developed, 
> which are cross-toolkit SMARTS patterns derived closely from CACTVS/PubChem 
> substructure keys. After 10 minutes and 37 seconds of SMARTS matching to 
> generate the structure fingerprints, it started searching.
>
> PubChem/CACTVS fingerprint screen results:
>
> [22:16:21] INFO: FINISHED 50001 (41150823 total, 573179 searched, 327042 
> found) in 98.06
> [22:16:21] INFO:   screenout: 0.01, accuracy: 0.57
>
>
> Hmmmm. Only 327042 found?

Probably the same problem? I haven't looked at your implementation there.

> Leaving all that aside, it's pretty clear that my method doesn't do well with 
> negatives, that is, with queries containing substructures that aren't in the 
> targets. I threw away all of the patterns which I knew were in the training 
> set and not in the targets, when I should be using it somehow.
>
> Perhaps my method is useful to enrich an existing fingerprint?

Including some bits associated with counts of "important" fragments, a
hybrid between the systematic search the RDKit is doing and your
definitions, is an interesting idea.

Tacking your most recent 54 bits, interpreted as SMILES, onto the end
of a shorter version of my new fingerprint gives:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched,
1118055 found) in 80.30
[06:10:40] INFO:   screenout: 0.03, accuracy: 0.84

The original performance for the shorter fingerprint was:
[05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched,
1118055 found) in 90.90
[05:04:23] INFO:   screenout: 0.04, accuracy: 0.66

so those bits definitely help with the screenout.

Putting the bits at the beginning of the fingerprint gets a bit more speed:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched,
1118055 found) in 71.09
[06:14:48] INFO:   screenout: 0.03, accuracy: 0.84

I've attached a script for doing this test.

-greg

"""
Results:

  Andrew's bits at the end of the FP:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30
[06:10:40] INFO:   screenout: 0.03, accuracy: 0.84

  Andrew's bits at the beginning of the FP:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09
[06:14:48] INFO:   screenout: 0.03, accuracy: 0.84


"""
from rdkit import Chem,DataStructs
import cPickle,time,gzip 
from rdkit.RDLogger import logger
logger=logger()

qs="""0 2 times O largest 55458
1 2 times Ccc largest 29602
2 1 times CCN largest 16829
3 1 times cnc largest 11439
4 1 times cN largest 8998
5 1 times C=O largest 7358
6 1 times CCC largest 6250
7 1 times S largest 4760
8 1 times c1ccccc1 largest 4524
9 2 times N largest 2854
10 1 times C=C largest 2162
11 1 times nn largest 1840
12 2 times CO largest 1248
13 1 times Ccn largest 964
14 1 times CCCCC largest 857
15 1 times cc(c)c largest 653
16 3 times O largest 653
17 1 times O largest 466
18 2 times CNC largest 464
19 1 times s largest 457
20 1 times CC(C)C largest 335
21 1 times o largest 334
22 1 times cncnc largest 334
23 1 times C=N largest 321
24 2 times CC=O largest 238
25 4 times Ccc largest 238
26 1 times Cl largest 230
27 4 times O largest 149
28 2 times ccncc largest 149
29 6 times CCCCCC largest 76
30 2 times c1ccccc1 largest 76
31 1 times F largest 75
32 3 times CCOC largest 44
33 3 times N largest 44
34 1 times c(cn)n largest 44
35 1 times N largest 41
36 9 times C largest 41
37 1 times CC=C(C)C largest 33
38 1 times c1ccncc1 largest 26
39 1 times CC(C)N largest 26
40 1 times CC largest 26
41 4 times CCC(C)O largest 25
42 2 times ccc(cc)n largest 21
43 6 times C largest 21
44 1 times C1CCCC1 largest 18
45 1 times C largest 18
46 5 times O largest 18
47 2 times Ccn largest 14
48 1 times CNCN largest 13
49 3 times cncn largest 13
50 1 times CSC largest 13
51 3 times CC=O largest 11
52 1 times CCNCCCN largest 11
53 1 times CccC largest 11
54 3 times ccccc(c)c largest 10"""


ssPatts=None
def GetSubstructFP(m,verbose=False):
    global ssPatts
    if ssPatts is None:
        ssPatts=[]
        for q in qs.split('\n'):
            q = q.split(' ')
            count = int(q[1])
            patt = Chem.MolFromSmiles(q[3],sanitize=False)
            ssPatts.append((patt,count))
    sz = len(ssPatts)
    lfp=Chem.LayeredFingerprint2(x,maxPath=4,fpSize=1024)
    res = DataStructs.ExplicitBitVect(1024+sz)
    for idx in lfp.GetOnBits():   # <-- not the efficient way of doing things
        res.SetBit(sz+idx)
    for i,(p,count) in enumerate(ssPatts):
        matches = m.GetSubstructMatches(p,uniquify=True)
        if len(matches)>=count:
            res.SetBit(i)

    return res
            

if __name__ == '__main__':
    import random
    logger.info('reading queries')
    if 1:
        pattData = gzip.open('fragqueries.txt.gz','rb').readlines()
        patts = [Chem.MolFromSmiles(x) for x in pattData]
    else:
        pattData = [x.strip().split('\t')[0] for x in file('zinc.leads.diverse.smi')]
        #pattData = [x.strip().split('\t')[0] for x in file('zinc.frags.diverse.smi')]
        random.seed(23)
        random.shuffle(pattData)
        patts = [Chem.MolFromSmiles(x) for x in pattData[:520]]
        patts = [x for x in patts if x is not None][:500]

    logger.info('reading mols')
    mols = []
    inf=file('zinc.100k.pkl','rb')
    while 1:
        try:
            id,m = cPickle.load(inf)
        except EOFError:
            break
        else:
            mols.append(m)
            if len(mols)>50000: break


    logger.info('generating pattern fps:')
    pattFps = [GetSubstructFP(x) for x in patts]

    logger.info('generating molecules fps:')
    molFps = [GetSubstructFP(x) for x in mols]

    pattMisses=[0]*len(pattFps)
    pattHits=[0]*len(pattFps)

    logger.info('scanning')
    t1 = time.time()
    nPossible=nMatched=nSearched=0
    t1=time.time()
    for j in range(len(mols)):
        mfp = molFps[j]
        for k in range(len(pattFps)):
            pfp = pattFps[k]
            nPossible+=1
            if DataStructs.AllProbeBitsMatch(pfp,mfp):
                fpMatch=True
                nSearched+=1
            else:
                fpMatch=False
            if fpMatch and mols[j].HasSubstructMatch(patts[k]):
                nMatched+=1
                if not fpMatch:
                    print 'mismatch:',j,k
                    print '  ',Chem.MolToSmiles(mols[j])
                    print '  ',Chem.MolToSmiles(patts[k])
                    raise ValueError
        if not (j+1)%500:
            logger.info('  done %d (%d total, %d searched, %d found)'%(j+1,nPossible,nSearched,nMatched))
    t2 = time.time()
    logger.info('FINISHED %d (%d total, %d searched, %d found) in %.2f'%(j+1,nPossible,nSearched,nMatched,t2-t1))
    logger.info('  screenout: %.2f, accuracy: %.2f'%(1.*nSearched/nPossible, 1.*nMatched/nSearched))

------------------------------------------------------------------------------
Learn Windows Azure Live!  Tuesday, Dec 13, 2011
Microsoft is holding a special Learn Windows Azure training event for 
developers. It will provide a great way to learn Windows Azure and what it 
provides. You can attend the event by watching it streamed LIVE online.  
Learn more at http://p.sf.net/sfu/ms-windowsazure

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] request for feedback/testing

Reply via email to