[Rdkit-discuss] Clustering

Tristan Camilleri Sun, 01 May 2022 06:23:05 -0700

Hi,

I am attempting to cluster a database of circa 4M small molecules and I
have hit several snags.
Using BulkTanimoto is not possible due to resiurces that are required. I am
now working with fpsim2 and chemfp to get a distance matrix (sparse
matrix). However, I am finding it very challenging to identify an
appropriate clustering algorithm. I have considered both k-medoids and
DBSCAN. Each of these has its own limitations, stating the number of
clusters for k-medoids and not obtaining centroids for DBSCAN.


I was wondering whether there is an implementation of the stochastic
clustering analysis for clustering purposes, described in
https://doi.org/10.1021/ci970056l .

Any suggestions on the best method for clustering large datasets, with code
suggestions, would be greatly appreciated. I am new to the subject and
would appreciate any help.

Regards,
Tristan

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] Clustering

Reply via email to