I am trying to relate the reliability of ML models trained using binary 
fingerprint to the presence of on-bits, i.e. comparing the on-bits in a 
molecule in the test set to the on-bits in the training set. But I am getting 
some strange results

The code is here so I will just summarise. 
https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing

I pick 1000 random molecules from ZINC as my "training set” and compute ECFP4 
fingerprints using nBits=10_000. There are 6226 unique on-bits. I use 
nBits=10_000 to try to avoid collisions

Then I compute the on-bits for a molecule (“test set") that is very different 
from any in the “training set" (OOOOO) and compare them to the  6226 unique 
on-bits on ZINC. Of the 7 on-bits for OOOOO shown here, five are found among 
the 6226 “training set" on-bits: 656, 2311, 4453, 4487, 8550
[cid:D672CC25-2D3B-44BD-A1C7-F1068E464841@home]

However, 656, 4453, and 8550 corresponds to different fragments for the 
"training set".

The only reason I can think of is bit-collisions in the hashing, but there are 
10000-6226 = 3774 unused positions.

Is there any other explanation? If not, what does that say about using bit 
vectors (especially the usual nBits = 2048) as descriptors?

Any insight is greatly appreciated.

Best regards, Jan
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to