Dear all,

I have checked in an alternate version of the code that does
topological (i.e. Daylight-like) fingerprinting. The new code uses a
simpler (and a bit faster) approach to generate hashes from the
subgraphs. It uses the same subgraphs as the old fingerprinter, so one
would expect comparable similarities from the two approaches.

As a quick test I generated fingerprints for 500 random molecules from
the pubchem hts set and computed their similarities to each other
using the old fingerprints and the new ones. The scatter plot of new
similarity vs old similarity is attached. The correlation is clearly
linear and the scatter isn't bad. The histogram of deltas (old-new) is
also pretty symmetrical, so I'm pretty happy with the new approach.

The new fingerprinter is accessible from Python as Chem.RDKFingerprint2.

If anyone has the chance to try the new code out, please let me know
what you find. I'm going to be going through the code a few more times
to tune it a bit and see if I can find any problems, but it would be
very useful if someone else also tried it. I'm particularly interested
in molecule pairs that show a very high (>0.9 or 0.95) similarity with
one method and a lower similarity with the other (delta > 0.1),
because these are the hard cases.

-greg

<<attachment: delta_sim.png>>

<<attachment: rdfp_rdfp2.png>>

Reply via email to