Re: [Rdkit-discuss] Clustering

Michał Nowotka Mon, 12 Jun 2017 07:08:15 -0700

Hi,

Thanks for all the answers, especially those pointing to code
examples, very useful.
I should be more specific when asking about clustering >2M compounds.


An example I would like to see would use:

1. A clustering algorithm, that does not require specifying the number
of classes upfront (so not K-means).
2. An algorithm that is a bit more sophisticated than Taylor-Butina
3. Preferably one from pyclustering (NOT from scipy.cluster, sorry for mistake)

In those, somewhat more sophisticated algorithms, running PCA will not
help. You can try to cluster >2M points on 2D surface and you will
find out that this is not a trivial task.

That being said, I don't expect any amazing results when doing
compound clustering using those algorithms. And I agree that
clustering a random sample can give similar results. This question is
more out of curiosity.

Michał

On Sun, Jun 11, 2017 at 7:58 PM, Samo Turk <[email protected]> wrote:
> Hi All,
>
> I have to admit I was commenting about PCA->k-means without actually trying.
> Out of curiosity I implemented it here:
> https://github.com/samoturk/cheminf-notebooks/tree/master/Python#pca-k-meanspy
>
> It can process 4M compounds in ~60 minutes on desktop i5 and it should work
> with 16GB or RAM. Clusters that come out make (some) sense but in this
> regard Butina is better.
>
> PS. DataWarrior can easily load the results (it takes 30 min) but then it
> works smoothly.
>
> Cheers,
> Samo
>
> On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal <[email protected]> wrote:
>>
>> Hello all ,
>>
>> How about doing some dimension reduction using  pca or Tsne and then run
>> clustering using some selected top components like top 20 and I think then
>> the clustering would be fast .
>>
>> Thanks
>> Abhik
>>
>> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <[email protected]>
>> wrote:
>>>
>>> Hi,
>>> I have used this algorithm for many years clustering sets of several
>>> millions of compounds.  Indeed, I am old enough to know it as the Taylor
>>> algorithm.  It is slow but reliable.  A crucial setting is the similarity
>>> threshold for the clusters, which dictates the size of the neighbour lists
>>> and hence the amount of RAM required.  It also, of course, determines the
>>> quality of the clusters.  My implementation is at
>>> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
>>> programs of relevance, the one you want is called cluster.  I have just
>>> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as ascii
>>> bitstrings, I don't have code for turning RDKit fingerprints into this
>>> format, but I would imagine it's quite straightforward.  The program runs in
>>> parallel using OpenMPI.  That's valuable for two reasons.  One is speed, but
>>> the more important one is memory use.  If you can spread the slave processes
>>> over several machines you can cluster much larger sets of molecules as you
>>> are effectively expanding the RAM of the machine.  When I wrote the
>>> original, 64MB was a lot of RAM, it is less of an issue these days but still
>>> matters if clustering millions of fingerprints.  Note that the program
>>> cluster doesn't ever store the distance matrix, just the lists of neighbours
>>> for each molecule within the threshold.  This reduces the memory footprint
>>> substantially if you have a tight-enough cluster threshold.
>>> HTH,
>>> Dave
>>>
>>>
>>>
>>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp <[email protected]>
>>> wrote:
>>>>
>>>> Hi Michal,
>>>>
>>>> I have done this a couple of times for compound sets up to 10M+ using a
>>>> simplified variant of the Taylor-Butina algorithm. The overall run time
>>>> was in the range of hours to a few days (which could probably be
>>>> optimized, but was fast enough for me).
>>>>
>>>> As you correctly mentioned, getting the (sparse) similarity matrix is
>>>> fairly simple (and can be done in parallel on a cluster). Unfortunately,
>>>> this matrix gets very large (even the sparse version). Most clustering
>>>> algorithms require random access to the matrix, so you have to keep it
>>>> in main memory (which then has to be huge) or calculate it on-the-fly
>>>> (takes forever).
>>>>
>>>> My implementation (in C++, not sure if I can share it) assumes that the
>>>> similarity matrix has been pre-calculated and is stored in one (or
>>>> multiple) files. It reads these files sequentially and whenever a
>>>> compound pair with a similarity beyond the threshold is found, it checks
>>>> whether one of the cpds. is already a centroid (in which case the other
>>>> is assigned to it). Otherwise, one of the compounds is randomly chosen
>>>> as centroid and the other is assigned to it.
>>>>
>>>> This procedure is highly order-dependent and thus not optimal, but has
>>>> to read the whole similarity matrix only once and has limited memory
>>>> consumption (you only need to keep a list of centroids). If you still
>>>> run into memory issues, you can start by clustering with a high
>>>> similarity threshold and then re-cluster centroids and singletons on a
>>>> lower threshold level.
>>>>
>>>> I also played around with DBSCAN for large compound databases, but (as
>>>> previously mentioned by Samo) found it difficult to find the right
>>>> parameters and ended up with a single huge cluster covering 90 percent
>>>> of the database in many cases.
>>>>
>>>> Hope this helps,
>>>> Nils
>>>>
>>>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
>>>> > Is there anyone who actually done this: clustered >2M compounds using
>>>> > any well-known clustering algorithm and is willing to share a code and
>>>> > some performance statistics?
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>>
>>>
>>> --
>>> David Cosgrove
>>> Freelance computational chemistry and chemoinformatics developer
>>> http://cozchemix.co.uk
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org!
>>> http://sdm.link/slashdot_______________________________________________
>>> Rdkit-discuss mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>> --
>>
>> Cheers,
>> Abhik Seal  Ph.D. (Cheminformatics)
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Clustering

Reply via email to