There have been some comparisons between different fingerprints, which Im
sure can be found via google, but I haven't seen feature selection.
If you're looking for dimensionality reduction, anecdotally I've noticed
that hashing bit vectors down to size has no benefit over the traditional
folding, and tf-idf has no benefit either (see below for hyperparameters I
used, being compared with standard Morgan 512-size bit vectors). I believe
morgan fingerprint bits have already been hashed, so the number of
bit-collisions during folding should be evenly spread.
The 'performance' in this context is virtual screening for protein hits, so
this may not apply to other tasks. My intuition is that there are a large
number of possible targets so it's probably not beneficial to reduce the
dimensionality below 256 dimensions.
tf-idf parameters I used:
fps_array = list()
for smi in smileses:
mol = Chem.MolFromSmiles(smi)
fp = AllChem.GetMorganFingerprint(mol, 2)
fp_values = fp.GetNonzeroElements()
stringerprint = list()
for key, item in fp_values.items():
stringerprint.append(str(key))
fps_array.append(stringerprint)
corpus = list()
for fp in fps_array:
corpus.append(' '.join(fp))
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.4, min_df=0.03)
X = vectorizer.fit_transform(corpus)
On Tue, Jun 18, 2019 at 6:42 AM Mario Lovrić <[email protected]>
wrote:
> Dear all,
>
> Is there any paper discussing some sort of pre-selection/feature selection
> with fingerprints?
> Any rule of thumb? E.g. dont keep fingeprints if less than 5% hits?
>
>
> Thanks
>
> --
> Mario Lovrić
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss