Dear all, I found some time this morning to go back to the substructure search optimization work I started back in February. This morning I finished porting the vf2 implementation from the vflib (what the RDKit has been using) to the boost graph library. This allows vf2 to be used from the RDKit without having to first do the data-structure translation step that has been required up until now.
The results are pleasing. Using the benchmarking code in $RDBASE/Data/SmartsLib/tests I get the following timings on my linux box: VFLib implementation of vf2: 49 seconds RDKit implementation of ullmann algorithm: 38 seconds RDKit implementation of vf2: 21 seconds This test uses a set of standard functional and reactive group queries (including some recursive SMARTS) to search a set of PubChem compounds. There are about 450 queries and 1000 molecules. Clearly vf2 is more efficient than ullmann here. The relatively poor performance of the vflib implementation of vf2 is almost certainly due to the translation step from a boost graph data structure to a vflib data structure. Using a test set of ~4000 "pubchem fragments" (pieces that result from applying RECAP to pubchem molecules) to search pubchem molecules, I get the following timings for the first million searches: VFLib implementation of vf2: 240 seconds RDKit implementation of ullmann algorithm: 288 seconds RDKit implementation of vf2: 108 seconds So the search time has been more or less halved. The new code is on a branch at the moment: http://rdkit.svn.sourceforge.net/viewvc/rdkit/branches/vf2_9April2009/ but I will move it onto the trunk in the not-too-distant future after I've had a chance to test it a bit more. -greg