On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: > I hope the memory issue won't be a problem.
That's up to you and your choice of threshold. > Most AgglomerativeClustering algorithms have time complexity with N^2. Will > that be a problem? You have to decided for yourself what counts as a problem. If you want to get it done in 1 minute with a threshold of 0.2, then you've got a problem. If you're willing to take a month, then there's no problem. With chemfp, Taylor-Butina clustering, at http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering , took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so should only take about an hour for 1 million fingerprints. Best of course is to start with a smaller system first, see if it works, and only then try to scale up. Then you'll have experience of which methods are appropriate and what your time constraints are. Andrew [email protected] ------------------------------------------------------------------------------ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

