On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
> I hope the memory issue won't be a problem.

That's up to you and your choice of threshold.

>  Most AgglomerativeClustering algorithms have time complexity with N^2. Will 
> that be a problem?

You have to decided for yourself what counts as a problem. If you want to get 
it done in 1 minute with a threshold of 0.2, then you've got a problem. If 
you're willing to take a month, then there's no problem.

With chemfp, Taylor-Butina clustering, at 
http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering 
, took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so 
should only take about an hour for 1 million fingerprints.

Best of course is to start with a smaller system first, see if it works, and 
only then try to scale up. Then you'll have experience of which methods are 
appropriate and what your time constraints are.


                                Andrew
                                [email protected]



------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to