Re: [Rdkit-discuss] Clustering

Peter S. Shenkin Mon, 12 Jun 2017 07:57:30 -0700

" A clustering algorithm, that does not require specifying the number
of classes upfront (so not K-means)."


A general approach to O(N) hierarchical clustering is:

1. Pick a random sqrt(N) structures.
2. Do full hierarchical O(N^2) clustering on these.
3. Select your favored clustering level to define clusters, and store the
centroid (or most representative member) of each.
4. For all N structures, associate each with the cluster whose centroid (or
most representative member) is closest to N.

I've never tried this, but I've heard it suggested at talks, which were
not, however, about molecular clustering; but the method should be general.

Step 3 gives you some control.

-P.

On Mon, Jun 12, 2017 at 10:06 AM, Michał Nowotka <[email protected]> wrote:

> Hi,
>
> Thanks for all the answers, especially those pointing to code
> examples, very useful.
> I should be more specific when asking about clustering >2M compounds.
>
> An example I would like to see would use:
>
> 1. A clustering algorithm, that does not require specifying the number
> of classes upfront (so not K-means).
> 2. An algorithm that is a bit more sophisticated than Taylor-Butina
> 3. Preferably one from pyclustering (NOT from scipy.cluster, sorry for
> mistake)
>
> In those, somewhat more sophisticated algorithms, running PCA will not
> help. You can try to cluster >2M points on 2D surface and you will
> find out that this is not a trivial task.
>
> That being said, I don't expect any amazing results when doing
> compound clustering using those algorithms. And I agree that
> clustering a random sample can give similar results. This question is
> more out of curiosity.
>
> Michał
>
> On Sun, Jun 11, 2017 at 7:58 PM, Samo Turk <[email protected]> wrote:
> > Hi All,
> >
> > I have to admit I was commenting about PCA->k-means without actually
> trying.
> > Out of curiosity I implemented it here:
> > https://github.com/samoturk/cheminf-notebooks/tree/master/
> Python#pca-k-meanspy
> >
> > It can process 4M compounds in ~60 minutes on desktop i5 and it should
> work
> > with 16GB or RAM. Clusters that come out make (some) sense but in this
> > regard Butina is better.
> >
> > PS. DataWarrior can easily load the results (it takes 30 min) but then it
> > works smoothly.
> >
> > Cheers,
> > Samo
> >
> > On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal <[email protected]> wrote:
> >>
> >> Hello all ,
> >>
> >> How about doing some dimension reduction using  pca or Tsne and then run
> >> clustering using some selected top components like top 20 and I think
> then
> >> the clustering would be fast .
> >>
> >> Thanks
> >> Abhik
> >>
> >> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <
> [email protected]>
> >> wrote:
> >>>
> >>> Hi,
> >>> I have used this algorithm for many years clustering sets of several
> >>> millions of compounds.  Indeed, I am old enough to know it as the
> Taylor
> >>> algorithm.  It is slow but reliable.  A crucial setting is the
> similarity
> >>> threshold for the clusters, which dictates the size of the neighbour
> lists
> >>> and hence the amount of RAM required.  It also, of course, determines
> the
> >>> quality of the clusters.  My implementation is at
> >>> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number
> of
> >>> programs of relevance, the one you want is called cluster.  I have just
> >>> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
> ascii
> >>> bitstrings, I don't have code for turning RDKit fingerprints into this
> >>> format, but I would imagine it's quite straightforward.  The program
> runs in
> >>> parallel using OpenMPI.  That's valuable for two reasons.  One is
> speed, but
> >>> the more important one is memory use.  If you can spread the slave
> processes
> >>> over several machines you can cluster much larger sets of molecules as
> you
> >>> are effectively expanding the RAM of the machine.  When I wrote the
> >>> original, 64MB was a lot of RAM, it is less of an issue these days but
> still
> >>> matters if clustering millions of fingerprints.  Note that the program
> >>> cluster doesn't ever store the distance matrix, just the lists of
> neighbours
> >>> for each molecule within the threshold.  This reduces the memory
> footprint
> >>> substantially if you have a tight-enough cluster threshold.
> >>> HTH,
> >>> Dave
> >>>
> >>>
> >>>
> >>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <[email protected]>
> >>> wrote:
> >>>>
> >>>> Hi Michal,
> >>>>
> >>>> I have done this a couple of times for compound sets up to 10M+ using
> a
> >>>> simplified variant of the Taylor-Butina algorithm. The overall run
> time
> >>>> was in the range of hours to a few days (which could probably be
> >>>> optimized, but was fast enough for me).
> >>>>
> >>>> As you correctly mentioned, getting the (sparse) similarity matrix is
> >>>> fairly simple (and can be done in parallel on a cluster).
> Unfortunately,
> >>>> this matrix gets very large (even the sparse version). Most clustering
> >>>> algorithms require random access to the matrix, so you have to keep it
> >>>> in main memory (which then has to be huge) or calculate it on-the-fly
> >>>> (takes forever).
> >>>>
> >>>> My implementation (in C++, not sure if I can share it) assumes that
> the
> >>>> similarity matrix has been pre-calculated and is stored in one (or
> >>>> multiple) files. It reads these files sequentially and whenever a
> >>>> compound pair with a similarity beyond the threshold is found, it
> checks
> >>>> whether one of the cpds. is already a centroid (in which case the
> other
> >>>> is assigned to it). Otherwise, one of the compounds is randomly chosen
> >>>> as centroid and the other is assigned to it.
> >>>>
> >>>> This procedure is highly order-dependent and thus not optimal, but has
> >>>> to read the whole similarity matrix only once and has limited memory
> >>>> consumption (you only need to keep a list of centroids). If you still
> >>>> run into memory issues, you can start by clustering with a high
> >>>> similarity threshold and then re-cluster centroids and singletons on a
> >>>> lower threshold level.
> >>>>
> >>>> I also played around with DBSCAN for large compound databases, but (as
> >>>> previously mentioned by Samo) found it difficult to find the right
> >>>> parameters and ended up with a single huge cluster covering 90 percent
> >>>> of the database in many cases.
> >>>>
> >>>> Hope this helps,
> >>>> Nils
> >>>>
> >>>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
> >>>> > Is there anyone who actually done this: clustered >2M compounds
> using
> >>>> > any well-known clustering algorithm and is willing to share a code
> and
> >>>> > some performance statistics?
> >>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------
> ------------------
> >>>> Check out the vibrant tech community on one of the world's most
> >>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >>>> _______________________________________________
> >>>> Rdkit-discuss mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> David Cosgrove
> >>> Freelance computational chemistry and chemoinformatics developer
> >>> http://cozchemix.co.uk
> >>>
> >>>
> >>> ------------------------------------------------------------
> ------------------
> >>> Check out the vibrant tech community on one of the world's most
> >>> engaging tech sites, Slashdot.org!
> >>> http://sdm.link/slashdot____________________________________
> ___________
> >>> Rdkit-discuss mailing list
> >>> [email protected]
> >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >>
> >> --
> >>
> >> Cheers,
> >> Abhik Seal  Ph.D. (Cheminformatics)
> >>
> >>
> >>
> >> ------------------------------------------------------------
> ------------------
> >> Check out the vibrant tech community on one of the world's most
> >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> >> _______________________________________________
> >> Rdkit-discuss mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >>
> >
> >
> > ------------------------------------------------------------
> ------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> > _______________________________________________
> > Rdkit-discuss mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> >
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Clustering

Reply via email to