Dear all, I have checked in an alternate version of the code that does topological (i.e. Daylight-like) fingerprinting. The new code uses a simpler (and a bit faster) approach to generate hashes from the subgraphs. It uses the same subgraphs as the old fingerprinter, so one would expect comparable similarities from the two approaches.
As a quick test I generated fingerprints for 500 random molecules from the pubchem hts set and computed their similarities to each other using the old fingerprints and the new ones. The scatter plot of new similarity vs old similarity is attached. The correlation is clearly linear and the scatter isn't bad. The histogram of deltas (old-new) is also pretty symmetrical, so I'm pretty happy with the new approach. The new fingerprinter is accessible from Python as Chem.RDKFingerprint2. If anyone has the chance to try the new code out, please let me know what you find. I'm going to be going through the code a few more times to tune it a bit and see if I can find any problems, but it would be very useful if someone else also tried it. I'm particularly interested in molecule pairs that show a very high (>0.9 or 0.95) similarity with one method and a lower similarity with the other (delta > 0.1), because these are the hard cases. -greg
<<attachment: delta_sim.png>>
<<attachment: rdfp_rdfp2.png>>

