Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote: Thanks, Andrew! Yes, I was thinking about using scikit-learn also. But I guess I need to use a data structure for sparse matrix and define a function for connectivity. I hope the memory issue won't be a problem. Most AgglomerativeClustering algorithms

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki, For both Repeated Bisection clustering and K-means clustering, they all need the number of clusters as input, right? Best, Jing On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri serit...@gmail.com wrote: Dear Jing, How about your trying using bayon ?

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote: If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I calculate the distance between every pair of molecules, the size of distance matrix will be too big. Does RDKit support any heuristic clustering algorithm without

[Rdkit-discuss] chirality perception

2015-08-23 Thread S.L. Chan
Dear colleagues, I got two conformers of a molecule. It looks like they have the same chirality (S, I believe). But RDKit perceives them as different. Is it a bug, or did I miss anything? Thank you. Ling mol = Chem.MolFromMolFile(input.sdf, removeHs=False)

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote: I hope the memory issue won't be a problem. That's up to you and your choice of threshold. Most AgglomerativeClustering algorithms have time complexity with N^2. Will that be a problem? You have to decided for yourself what counts as a problem.