Hi Greg,
just a curiosity ...
765534 vs 76522
is one a subset of the other? If not - would it make sense to test on
both?
Just a thought. Apart from that I think the setup is reasonable for most
applications we will have ...
Nik
Greg Landrum greg.land...@gmail.com
10.02.2009 15:11
To
RDKit Discuss rdkit-discuss@lists.sourceforge.net
cc
Subject
[Rdkit-discuss] Optimizing SSS in the RDKit
Dear all
Andrew's question about fingerprints hit me at the right time: I had
just finished doing some optimization work on the RDKit substructure
search machinery (removing the vflib dependency). The details are
here:
http://code.google.com/p/rdkit/wiki/SubgraphIsomorphismOptimization
It would be quite interesting to use the new Ullmann code as a
framework and do an implementation of the VF or VF2 algorithms used in
vflib.
Of course there's no better way to optimize subgraph isomorphism than
to avoid it all together, which is where the fingerprints mentioned
come in. I'm spending a couple of days home from work (with a cold),
so I have some room to explore here a little bit.
I put together a sandbox using my 1000 pubchem molecules (they're from
the HTS set, so they are all either drug-like or lead-like, whatever
that means). To get a set of molecule-like queries, I fragmented
those 1000 molecules using RECAP and kept the 823 unique fragments I
got.
I've been using those 823 molecules to query the full set of 1000
molecules and looking at how many calls to the isomorphism code I can
avoid using either the RDKit (daylight-like) fingerprints or the
layered fingerprints (out to layer 0x4, beyond that these aren't
suitable for SSS).
The results look pretty encouraging: I can easily filter out more than
90% of the comparisons via fingerprints without losing anything. There
are 823000 (823x1000) possible comparisons with my dataset; using the
RDKit fingerprints as a screen I filter out 765534 of them (93%) using
the layered fingerprints I filter out 765224 (also 93%). The screening
[not even remotely optimized, I'm calculating (AB)==A instead of
doing it on the fly and short circuiting when something mismatches]
takes about 10 seconds in each case.
By default each fingerprint uses 2048 bits. I can shrink this by
folding the fingerprints (or generating them shorter in the first
place... the end result is the same). That potentially gains speed and
certainly saves storage space, but there may be a cost at how
discriminating the fingerprints are.
Experiment 1: reduce fps to 1024 bits
RDK fingerprints: filter out 717356 (87%)
Layered: filter out 752948 (91%)
No obvious speed improvement
Experiment 2: reduce fps to 512 bits
RDK fingerprints: filter out 441529 (54%)
Layered: filter out 710647 (86%)
10-15% faster
The layered fps are clearly more robust w.r.t. fingerprint size (which
makes sense: I only set one bit per path there as opposed to 4 per
path for the RDKit fp; a good experiment would be to try the RDKit fps
with one bit per path). They're also faster to generate (they no
longer require a PRNG).
I think the screening speed thing is a bit of a red herring at the
moment since I'm not doing a smart screen, but there is a real impact
on storage space.
So what does the community think? Interesting results? Arguments
about my testing/benchmarking methodology? Obvious next steps?
Suggestions for improving the layered fps so that they're even more
discriminating?
-greg
--
Create and Deploy Rich Internet Apps outside the browser with
Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code
to
build responsive, highly engaging applications that combine the power of
local
resources and data with the reach of the web. Download the Adobe AIR SDK
and
Ajax docs to start building applications
today-http://p.sf.net/sfu/adobe-com
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_
CONFIDENTIALITY NOTICE
The information contained in this e-mail message is intended only for the
exclusive use of the individual or entity named above and may contain
information that is privileged, confidential or exempt from disclosure
under applicable law. If the reader of this message is not the intended
recipient, or the employee or agent responsible for delivery of the
message to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please
notify the sender immediately by e-mail and delete the material from any
computer. Thank you.