I am trying to relate the reliability of ML models trained using binary fingerprint to the presence of on-bits, i.e. comparing the on-bits in a molecule in the test set to the on-bits in the training set. But I am getting some strange results
The code is here so I will just summarise. https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing I pick 1000 random molecules from ZINC as my "training set” and compute ECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. I use nBits=10_000 to try to avoid collisions Then I compute the on-bits for a molecule (“test set") that is very different from any in the “training set" (OOOOO) and compare them to the 6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here, five are found among the 6226 “training set" on-bits: 656, 2311, 4453, 4487, 8550 [cid:D672CC25-2D3B-44BD-A1C7-F1068E464841@home] However, 656, 4453, and 8550 corresponds to different fragments for the "training set". The only reason I can think of is bit-collisions in the hashing, but there are 10000-6226 = 3774 unused positions. Is there any other explanation? If not, what does that say about using bit vectors (especially the usual nBits = 2048) as descriptors? Any insight is greatly appreciated. Best regards, Jan
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss