Re: [Rdkit-discuss] use cases for weighted sampling of a compound library

Rocco Moretti Sun, 11 Dec 2022 12:01:07 -0800

The use case for this sort of thing which immediately springs to mind would
be decoy selection. That is, you have a known set of "positives" and want
to find a set of "negatives"/"background" which match those compounds in
some set of properties. DUD-E <http://dude.docking.org/>is probably the
most well-known example of this, but it's an approach which has been tried
numerous times on both a formal and an ad hoc basis.

These days where you'll see people attempting it would be with machine
learning, to try to get around the "positive/unlabeled" nature of most
small molecule datasets. That is, often it's easy to get sets of "positive"
compounds from the literature/etc., but trying to get a set of known
negative compounds is sometimes difficult. People attempt to find synthetic
negatives by using matched-property decoy selection from an external
compound set. However, the literature on this is ... less than flattering.
It turns out that even if you're careful, your negative selection method
can still be biased and the ML method can pick up on this (see, e.g.
https://doi.org/10.1021/acs.jcim.8b00712) -- this is similar to the tales
of image recognition software which uses the presence of grass to tell if
something is a cow or not, or which fails because all the pictures of one
kind of tank were taken on a cloudy day. ML is very good at picking up such
small differences, even if you don't know what those actually are.

On Sun, Dec 11, 2022 at 11:25 AM Christopher Mayer-Bacon <cmaye...@umbc.edu>
wrote:

> Hello all,
>
> I’m starting a project that explores the sampling of a large compound
> library.  My question is not so much about how to do something, but rather
> the specific use cases for weighted sampling from a compound library.
>
> Given a large compound library and a smaller, reference library, I want to
> take random samples from the large library such that the samples resemble
> the reference library in some way.  At the moment I’m focused on element
> composition (% of carbon atoms, % of oxygen atoms, etc.), but I’m open to
> using other features in the future.
>
> I have an idea of how to perform this sampling; my question for this
> community concerns a possible use case.  What would be the benefit of
> sampling from a compound library such that the samples resemble another
> library in some way?  I can think of a use case for my specific research
> niche (adaptive properties of the canonical amino acid alphabet), but I
> can’t think of another potential use case.  I know the RDKit community has
> a wide variety of backgrounds and expertise, hence why I wanted to pose
> this question to you all.
>
> -Chris
>
> --
> -Christopher Mayer-Bacon (*he/him/his*)
> PhD student
> Department of Biological Sciences
> University of Maryland, Baltimore County
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] use cases for weighted sampling of a compound library

Reply via email to