Hi Jan,

I did a long(ish) post on collisions and fingerprints a while ago that I
think might be helpful here:
https://sourceforge.net/p/rdkit/mailman/message/36438523/

I won't repost the whole thing here, but maybe you can take  a look to see
if it helps explain what you're observing (and why it probably isn't so
bad). If there's anything missing, please followup!

A bit more on the machine learning side: I don't have a reference handy,
but I've seen a number of ML talks/papers where people have mentioned along
the way that using a shorter fingerprint actually improves generalization
performance of the models. I assume this is because it helps reduce
overfitting, but I haven't really dug into this myself (yet?)

-greg

On Sat, Jan 9, 2021 at 11:17 AM Jan Halborg Jensen <jhjen...@chem.ku.dk>
wrote:

> I am trying to relate the reliability of ML models trained using binary
> fingerprint to the presence of on-bits, i.e. comparing the on-bits in a
> molecule in the test set to the on-bits in the training set. But I am
> getting some strange results
>
> The code is here so I will just summarise.
> https://colab.research.google.com/drive/19uYGmnsL8hM0JWFTMd1dfQLkHA4JqFik?usp=sharing
>
>
> I pick 1000 random molecules from ZINC as my "training set” and compute
> ECFP4 fingerprints using nBits=10_000. There are 6226 unique on-bits. I use
> nBits=10_000 to try to avoid collisions
>
Then I compute the on-bits for a molecule (“test set") that is very
> different from any in the “training set" (OOOOO) and compare them to the
>  6226 unique on-bits on ZINC. Of the 7 on-bits for OOOOO shown here, five
> are found among the 6226 “training set" on-bits: 656, 2311, 4453, 4487, 8550
>
> However, 656, 4453, and 8550 corresponds to different fragments for the
> "training set".
>
> The only reason I can think of is bit-collisions in the hashing, but there
> are 10000-6226 = 3774 unused positions.
>
> Is there any other explanation? If not, what does that say about using bit
> vectors (especially the usual nBits = 2048) as descriptors?
>
> Any insight is greatly appreciated.
>
> Best regards, Jan
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to