Re: [Rdkit-discuss] Clustering - visualization?
Thanks, Curt! I'll give those a look. It'll give me a very good reason to start digging into SciPy a bit more and exploit the added functionality that will bring. Regarding my original question and for anyone else that might be interested... I did indeed find an answer through a lot of code dredging. I found the Murtagh.ClusterData() function in RDKit, and was able to generate clusters from that. The function returns a single member list, that single member being a Cluster object. I can feed that object to ClusterVis.ClusterToImg to get the dendrogram I wanted. Here's a short code snip showing the pieces. ... c_tree = Murtagh.ClusterData(dists,nfps,Murtagh.WARDS,isDistData=True) ... rdkit.ML.Cluster.ClusterVis.ClusterToImg(c_tree[0], size=(500,500), fileName='test.png') ... I can then break the cluster tree into subtrees: ... rdkit.ML.Cluster.ClusterUtils.SplitIntoNClusters(c_tree[0], 5) ... And I've written a short function to extract out the individual structure memberships for each group: ... groups = ClusterUtils.SplitIntoNClusters(c_tree[0], 5) def GetGroupMembers( grp, memberlist=[] ): for child in grp.GetChildren(): if (child.GetData() is None ): GetGroupMembers( child, memberlist ) else: memberlist.append( child.GetData() ) return memberlist print GetGroupMembers(groups[0]) On Sat, May 14, 2016 at 11:21 AM, Curt Fischer wrote: > Hi Robert, > > For the number of molecules you are interested in, it's viable to use > SciPy / NumPy clustering functions instead of rdkit's built in C-linked > functions. This approach will probably not be as fast rdkit's built-in > clustering functionalities, and will probably not scale to tens of > thousands of molecules as well as rdkit's functions, but if you use SciPy > or NumPy in other types of technical computing, this approach may be more > transparent, generalizable, and easier to use. > > I have an example Jupyter notebook in GitHub that describes what I mean; > here are the GitHub and nbviewer links: > > > https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb > > https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb > > Here are some of the most important parts of the code for generating a > dendrogram. > > 1. Generate a numpy fingerprint matrix from a list of rdkit Molecules. > > for smiles in smiles_list: > mol = Chem.MolFromSmiles(smiles) > mols.append(mol) > fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol, fpSize = > 2048), dtype = 'bool') for mol in mols) > > > 2. Generate the distance matrix. *pdist* and *squareform* are from > *scipy.spatial.distance*. > > dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame( > squareform(dist_mat), index = smiles_list, columns= smiles_list) > > As far as I can tell, the Jaccard distance is equivalent to one minus the > Tanimoto similarity. > > 3. Perform hierarchical clustering on the distance matrix and show the > dendrogram (see the github notebook for the plot). *hc* is > *scipy.cluster.hierarchy*. > > z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z, labels=dist_df.columns, > leaf_rotation=90)plt.show() > > > A helpful page for dendrograms using SciPy is this one: > https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ > > Good luck! > > Curt > > On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle > wrote: > >> Next up is clustering... >> >> I've got about 350 structures to cluster and I've worked through the >> example code from the RDKit Cookbook ( >> http://www.rdkit.org/docs/Cookbook.html#clustering-molecules). All >> seems well and good there, but I would like to see the dendrogram. I see >> that there is a ClusterVis module to generate images, PDF, and SVG, but all >> require a Cluster object as input. I don't find anywhere a description of >> acquiring or building that object based upon the results of clustering. >> >> Any tips? >> >> -Kirk >> >> >> >> >> -- >> Mobile security can be enabling, not merely restricting. Employees who >> bring their own devices (BYOD) to work are irked by the imposition of MDM >> restrictions. Mobile Device Manager Plus allows you to control only the >> apps on BYO-devices by containerizing them, leaving personal data >> untouched! >> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you
Re: [Rdkit-discuss] Clustering - visualization?
Hi Robert, For the number of molecules you are interested in, it's viable to use SciPy / NumPy clustering functions instead of rdkit's built in C-linked functions. This approach will probably not be as fast rdkit's built-in clustering functionalities, and will probably not scale to tens of thousands of molecules as well as rdkit's functions, but if you use SciPy or NumPy in other types of technical computing, this approach may be more transparent, generalizable, and easier to use. I have an example Jupyter notebook in GitHub that describes what I mean; here are the GitHub and nbviewer links: https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb Here are some of the most important parts of the code for generating a dendrogram. 1. Generate a numpy fingerprint matrix from a list of rdkit Molecules. for smiles in smiles_list: mol = Chem.MolFromSmiles(smiles) mols.append(mol) fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol, fpSize = 2048), dtype = 'bool') for mol in mols) 2. Generate the distance matrix. *pdist* and *squareform* are from *scipy.spatial.distance*. dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame( squareform(dist_mat), index = smiles_list, columns= smiles_list) As far as I can tell, the Jaccard distance is equivalent to one minus the Tanimoto similarity. 3. Perform hierarchical clustering on the distance matrix and show the dendrogram (see the github notebook for the plot). *hc* is *scipy.cluster.hierarchy*. z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z, labels=dist_df.columns, leaf_rotation=90)plt.show() A helpful page for dendrograms using SciPy is this one: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/ Good luck! Curt On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle wrote: > Next up is clustering... > > I've got about 350 structures to cluster and I've worked through the > example code from the RDKit Cookbook ( > http://www.rdkit.org/docs/Cookbook.html#clustering-molecules). All seems > well and good there, but I would like to see the dendrogram. I see that > there is a ClusterVis module to generate images, PDF, and SVG, but all > require a Cluster object as input. I don't find anywhere a description of > acquiring or building that object based upon the results of clustering. > > Any tips? > > -Kirk > > > > > -- > Mobile security can be enabling, not merely restricting. Employees who > bring their own devices (BYOD) to work are irked by the imposition of MDM > restrictions. Mobile Device Manager Plus allows you to control only the > apps on BYO-devices by containerizing them, leaving personal data > untouched! > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Clustering - visualization?
Next up is clustering... I've got about 350 structures to cluster and I've worked through the example code from the RDKit Cookbook ( http://www.rdkit.org/docs/Cookbook.html#clustering-molecules). All seems well and good there, but I would like to see the dendrogram. I see that there is a ClusterVis module to generate images, PDF, and SVG, but all require a Cluster object as input. I don't find anywhere a description of acquiring or building that object based upon the results of clustering. Any tips? -Kirk -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GetSubstructMatch vs MMFFOptimize
Dear Robert, the reason of the failure is that MMFF uses its own aromaticity model (see http://www.rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules). Therefore, after calling AllChem.MMFFOptimizeMolecule(mols[0]) you will need to add the following call: AllChem.SanitizeMol(mols[0], sanitizeOps = AllChem.SanitizeFlags.SANITIZE_KEKULIZE \ | AllChem.SanitizeFlags.SANITIZE_SETAROMATICITY) This will fix your problem. Kind regards, Paolo On 05/14/2016 07:07 AM, Robert DeLisle wrote: RDKitters, I'm working on a project in which I want to align a collection of structures with their most similar structures and display the results in PyMOL. To accomplish this, I've built a Python script similar to the one attached here in which I start with pairs of structures, find the MCS of those structures, create a template based on the MCS and a 3D conformation of the structure of interest, and then generate a constrained conformation of a query structure. I tried to comment the attached code enough to lead you through the process. What I find is that quite often, the ConstrainedEmbed() function fails with the error "molecule doesn't match the core" which seems very odd since the pairs for which it fails are very similar. The attached .png shows one such pair and their MCS. What I've found is that when I generate a 3D conformation for the first structure and optimize it with MMFF (MMFFOptimize), this often causes GetSubstructMatch to fail finding the MCS within the structure. If instead I used UFFOptimize, everything seems to work OK most of the time. In my code, I've noted where the error occurs and flanked it with some print statements to show what happens. Specficially, at like 36 I have the MMFFOptimize line, and at 37 the UFFOptimize line. I've also attached a set of structures for which MMFF fails. While using UFFOptimize produces great results, I'm curious regarding why MMFFOptimize creates a problem. And, whether this is a bug which should be fixed, or just a glitch related to atom typing and other parameterizations that occur with MMFF. Thanks for any explanation or ideas. -Kirk -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss