Hi Chris,
as far as I know, Butina's sphere exclusion algorithm is the fastest for
very large datasets. But if you have 4 million compounds, using RDKit
directly can result in very long runs, even after parallellization. For
that number of molecules I think there are faster things, like chemfp (see
for instance
https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering
).
Cheers
Gonzalo
On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski <mac...@wojcikowski.pl>
wrote:
> Is there a big difference in the quality of the final dataset between
> K-means and random under-sampling of big database (~20M)?
>
> ----
> Pozdrawiam, | Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2017-06-04 12:24 GMT+02:00 Samo Turk <samo.t...@gmail.com>:
>
>> Hi Chris,
>>
>> There are other options for clustering. According to this:
>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
>> density and it also allows for outliers, but can be fiddly to find the
>> right parametes. You can not specify the number of clusters (like in Butina
>> case). If you want to specify the number of clusters, you can simply use
>> K-means. High dimensionality of fingerprints might be a problem for memory
>> consumption. In this case you can use PCA to reduce dimensions to something
>> manageable. To avoid memory issues with PCA and speed things up I would fit
>> the model on random 100k compounds and then just use fit_transform method
>> on the rest. http://scikit-learn.org/stable/modules/generated/sklea
>> rn.decomposition.PCA.html
>>
>> Cheers,
>> Samo
>>
>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain <sw...@mac.com> wrote:
>>
>>> Hi,
>>>
>>> I want to do clustering on around 4 million structures
>>>
>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>>
>>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>>> to use the Butina clustering algorithm”
>>>
>>> However it is quite a step up from a few thousand to several million
>>> and I wondered if anyone had used this algorithm on larger data sets?
>>>
>>> As far as I can tell it is not possible to define the number of
>>> clusters, is this correct?
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss