Dear all,

I found some time this morning to go back to the substructure search
optimization work I started back in February. This morning I finished
porting the vf2 implementation from the vflib (what the RDKit has been
using) to the boost graph library. This allows vf2 to be used from the
RDKit without having to first do the data-structure translation step
that has been required up until now.

The results are pleasing.

Using the benchmarking code in $RDBASE/Data/SmartsLib/tests I get the
following timings on my linux box:
  VFLib implementation of vf2: 49 seconds
  RDKit implementation of ullmann algorithm: 38 seconds
  RDKit implementation of vf2: 21 seconds
This test uses a set of standard functional and reactive group queries
(including some recursive SMARTS) to search a set of PubChem
compounds. There are about 450 queries and 1000 molecules.

Clearly vf2 is more efficient than ullmann here. The relatively poor
performance of the vflib implementation of vf2 is almost certainly due
to the translation step from a boost graph data structure to a vflib
data structure.

Using a test set of ~4000 "pubchem fragments" (pieces that result from
applying RECAP to pubchem molecules) to search pubchem molecules, I
get the following timings for the first million searches:
  VFLib implementation of vf2: 240 seconds
  RDKit implementation of ullmann algorithm: 288 seconds
  RDKit implementation of vf2: 108 seconds

So the search time has been more or less halved.

The new code is on a branch at the moment:
http://rdkit.svn.sourceforge.net/viewvc/rdkit/branches/vf2_9April2009/
but I will move it onto the trunk in the not-too-distant future after
I've had a chance to test it a bit more.

-greg

Reply via email to