Hi Andrew,
there is some discussion of the parameter available at
https://docs.chemaxon.com/display/docs/fingerprints_chemical-hashed-fingerprint.md#src-1806332-chemicalhashedfingerprint-references
I would also assume that it is mainly relevant for database pre-screens
in substructure searches and provides a simple way to tune the
"darkness" of the fingerprint.
Hope this helps,
Nils
On 1/27/2026 12:52 PM, Andrew Dalke wrote:
Hi all,
Does anyone here have experience in using different values for the
numBitsPerFeature parameter of the RDKit fingerprint generator, or can point me
to a publication exploring that parameter? I suspect it's not that useful, and
the default should be 1 instead of 2.
Quoting the documentation, numBitsPerFeature sets "the number of bits set per
path/subgraph found".
As I understand the history, this parameter derives from the Daylight
documentation, at
https://www.daylight.com/dayhtml/doc/theory/theory.finger.html , which says:
"Instead, each pattern serves as a seed to a pseudo-random number generator (it is
"hashed"), the output of which is a set of bits (typically 4 or 5 bits per pattern);"
I've been working on a related topic - count emulation using binary
fingerprints. For each count C and fingerprint size N I select a random number
in the range 0..N-1 (ie, randrange(N)) and set the corresponding bit to 1;
repeated C times.
I thought the numBitsPerFeature equivalent would be useful, that is, repeat the
sampling numBitsPerFeature*C times. I thought this would be more likely to
identify near neighbors as it would increase the number of shared bits between
two similar fingerprints.
I tested my method against the exact solution. I found that numBitsPerFeature
was not useful. That is, numBitsPerFeature=1 for a given N was essentially
always better than numBitsPerFeature=2 for the same number of bits N.
I did find that numBitsPerFeature=2 for 2*N bits was slightly better than
numBitsPerFeature=1 for N bits, but again numBitsPerFeature=1 for 2*N bits was
still better than numBitsPerFeature=2 for 2*N bits.
(See my preprint at https://chemrxiv.org/doi/full/10.26434/chemrxiv-2026-j3hbj )
I tried to figure this out mathematically. My simple attempt says the
numBitsPerFeature shouldn't affect things at all. In short, if the original
fingerprints have A and B features, C features in common, with A and B are much
less than N, then the number of bits set by the fingerprints is approximately
f(k) = N(1-exp(-k/N)), i.e. a = N(1-exp(-A/N)) and b = N(1-exp(-A/N)). This
formula related to the Birthday Problem.
If we assume C maps the same way then the Tanimoto is
T(fp_A, fp_B) = c / (a + b - c)
T(fp_A, fp_B) = N(1-exp(-C/N)) /
((N(1-exp(-A/N)) + N(1-exp(-B/N)) - N(1-exp(-C/N)))
If the number of bits per feature is doubled, and the number of bits also
doubled, then the Tanimoto score is unchanged because the ratio 2*k/2*N stays
constant.
However, I don't think c (which is the number of bits in common as measured in
the final fingerprints) is correctly computed as f(C) because of the higher
chance of coincidental overlap with portions of (A-C) and (B-C). This analysis,
alas, is beyond my mathematical abilities.
Still, my simulations suggest that setting more than one bit per feature isn't
that useful.
I suspect this same conclusion would hold with the RDKit fingerprint generator,
that is, I suspect numBitsPerFeature=1 would give slightly more accurate
matches than numBitsPerFeature=2. Furthermore, it would improve the accuracy
for the current default of 2048 bits, and the 1024-bit version would be almost
as good as the current 2048 bits.
I'll add that the RDKit fingerprint generator is used for similarity, while the
Daylight fingerprints were also used as substructure search screens. In the
latter, the number of bits affects screenout, for information content reasons.
I've been told the Daylight fingerprint set a different number of bits
depending on the fingerprint length.
Best regards,
Andrew
[email protected]
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss