Hi Jan, I did a long(ish) post on collisions and fingerprints a while ago that I think might be helpful here: https://sourceforge.net/p/rdkit/mailman/message/36438523/
I won't repost the whole thing here, but maybe you can take a look to see if it helps explain what you're observing (and why it probably isn't so bad). If there's anything missing, please followup! A bit more on the machine learning side: I don't have a reference handy, but I've seen a number of ML talks/papers where people have mentioned along the way that using a shorter fingerprint actually improves generalization performance of the models. I assume this is because it helps reduce overfitting, but I haven't really dug into this myself (yet?) -greg On Sat, Jan 9, 2021 at 11:17 AM Jan Halborg Jensen <jhjen...@chem.ku.dk> wrote: > I am trying to relate the reliability of ML models trained using binary > fingerprint to the presence of on-bits, i.e. comparing the on-bits in a > molecule in the test set to the on-bits in the training set. But I am > getting some strange results > > The code is here so I will just summarise. > https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing > > > I pick 1000 random molecules from ZINC as my "training set” and compute > ECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. I use > nBits=10_000 to try to avoid collisions > Then I compute the on-bits for a molecule (“test set") that is very > different from any in the “training set" (OOOOO) and compare them to the > 6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here, five > are found among the 6226 “training set" on-bits: 656, 2311, 4453, 4487, 8550 > > However, 656, 4453, and 8550 corresponds to different fragments for the > "training set". > > The only reason I can think of is bit-collisions in the hashing, but there > are 10000-6226 = 3774 unused positions. > > Is there any other explanation? If not, what does that say about using bit > vectors (especially the usual nBits = 2048) as descriptors? > > Any insight is greatly appreciated. > > Best regards, Jan > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss