Re: [Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Robert DeLisle
Thanks, Curt!  I'll give those a look.  It'll give me a very good reason to
start digging into SciPy a bit more and exploit the added functionality
that will bring.

Regarding my original question and for anyone else that might be
interested...

I did indeed find an answer through a lot of code dredging.  I found the
Murtagh.ClusterData() function in RDKit, and was able to generate clusters
from that.  The function returns a single member list, that single member
being a Cluster object.  I can feed that object to ClusterVis.ClusterToImg
to get the dendrogram I wanted.  Here's a short code snip showing the
pieces.

...
c_tree = Murtagh.ClusterData(dists,nfps,Murtagh.WARDS,isDistData=True)
...
rdkit.ML.Cluster.ClusterVis.ClusterToImg(c_tree[0], size=(500,500),
fileName='test.png')
...

I can then break the cluster tree into subtrees:

...
rdkit.ML.Cluster.ClusterUtils.SplitIntoNClusters(c_tree[0], 5)
...

And I've written a short function to extract out the individual structure
memberships for each group:

...

groups = ClusterUtils.SplitIntoNClusters(c_tree[0], 5)

def GetGroupMembers( grp, memberlist=[] ):
for child in grp.GetChildren():
if (child.GetData() is None ):
GetGroupMembers( child, memberlist )
else:
memberlist.append( child.GetData() )

return memberlist

print GetGroupMembers(groups[0])




On Sat, May 14, 2016 at 11:21 AM, Curt Fischer 
wrote:

> Hi Robert,
>
> For the number of molecules you are interested in, it's viable to use
> SciPy / NumPy clustering functions instead of rdkit's built in C-linked
> functions.  This approach will probably not be as fast rdkit's built-in
> clustering functionalities, and will probably not scale to tens of
> thousands of molecules as well as rdkit's functions, but if you use SciPy
> or NumPy in other types of technical computing, this approach may be more
> transparent, generalizable, and easier to use.
>
> I have an example Jupyter notebook in GitHub that describes what I mean;
> here are the GitHub and nbviewer links:
>
>
> https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
>
> https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
>
> Here are some of the most important parts of the code for generating a
> dendrogram.
>
> 1. Generate a numpy fingerprint matrix from a list of rdkit Molecules.
>
> for smiles in smiles_list:
> mol = Chem.MolFromSmiles(smiles)
> mols.append(mol)
> fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol, fpSize = 
> 2048), dtype = 'bool') for mol in mols)
>
>
> 2. Generate the distance matrix.  *pdist* and *squareform* are from
> *scipy.spatial.distance*.
>
> dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame(
> squareform(dist_mat), index = smiles_list, columns= smiles_list)
>
> As far as I can tell, the Jaccard distance is equivalent to one minus the
> Tanimoto similarity.
>
> 3. Perform hierarchical clustering on the distance matrix and show the
> dendrogram (see the github notebook for the plot). *hc* is
> *scipy.cluster.hierarchy*.
>
> z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z, labels=dist_df.columns, 
> leaf_rotation=90)plt.show()
>
>
> A helpful page for dendrograms using SciPy is this one:
> https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
>
> Good luck!
>
> Curt
>
> On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle 
> wrote:
>
>> Next up is clustering...
>>
>> I've got about 350 structures to cluster and I've worked through the
>> example code from the RDKit Cookbook (
>> http://www.rdkit.org/docs/Cookbook.html#clustering-molecules).  All
>> seems well and good there, but I would like to see the dendrogram.  I see
>> that there is a ClusterVis module to generate images, PDF, and SVG, but all
>> require a Cluster object as input.  I don't find anywhere a description of
>> acquiring or building that object based upon the results of clustering.
>>
>> Any tips?
>>
>> -Kirk
>>
>>
>>
>>
>> --
>> Mobile security can be enabling, not merely restricting. Employees who
>> bring their own devices (BYOD) to work are irked by the imposition of MDM
>> restrictions. Mobile Device Manager Plus allows you to control only the
>> apps on BYO-devices by containerizing them, leaving personal data
>> untouched!
>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you 

Re: [Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Curt Fischer
Hi Robert,

For the number of molecules you are interested in, it's viable to use SciPy
/ NumPy clustering functions instead of rdkit's built in C-linked
functions.  This approach will probably not be as fast rdkit's built-in
clustering functionalities, and will probably not scale to tens of
thousands of molecules as well as rdkit's functions, but if you use SciPy
or NumPy in other types of technical computing, this approach may be more
transparent, generalizable, and easier to use.

I have an example Jupyter notebook in GitHub that describes what I mean;
here are the GitHub and nbviewer links:

https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb

Here are some of the most important parts of the code for generating a
dendrogram.

1. Generate a numpy fingerprint matrix from a list of rdkit Molecules.

for smiles in smiles_list:
mol = Chem.MolFromSmiles(smiles)
mols.append(mol)
fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol,
fpSize = 2048), dtype = 'bool') for mol in mols)


2. Generate the distance matrix.  *pdist* and *squareform* are from
*scipy.spatial.distance*.

dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame(
squareform(dist_mat), index = smiles_list, columns= smiles_list)

As far as I can tell, the Jaccard distance is equivalent to one minus the
Tanimoto similarity.

3. Perform hierarchical clustering on the distance matrix and show the
dendrogram (see the github notebook for the plot). *hc* is
*scipy.cluster.hierarchy*.

z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z,
labels=dist_df.columns, leaf_rotation=90)plt.show()


A helpful page for dendrograms using SciPy is this one:
https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

Good luck!

Curt

On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle  wrote:

> Next up is clustering...
>
> I've got about 350 structures to cluster and I've worked through the
> example code from the RDKit Cookbook (
> http://www.rdkit.org/docs/Cookbook.html#clustering-molecules).  All seems
> well and good there, but I would like to see the dendrogram.  I see that
> there is a ClusterVis module to generate images, PDF, and SVG, but all
> require a Cluster object as input.  I don't find anywhere a description of
> acquiring or building that object based upon the results of clustering.
>
> Any tips?
>
> -Kirk
>
>
>
>
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data
> untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Robert DeLisle
Next up is clustering...

I've got about 350 structures to cluster and I've worked through the
example code from the RDKit Cookbook (
http://www.rdkit.org/docs/Cookbook.html#clustering-molecules).  All seems
well and good there, but I would like to see the dendrogram.  I see that
there is a ClusterVis module to generate images, PDF, and SVG, but all
require a Cluster object as input.  I don't find anywhere a description of
acquiring or building that object based upon the results of clustering.

Any tips?

-Kirk
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GetSubstructMatch vs MMFFOptimize

2016-05-14 Thread Paolo Tosco

Dear Robert,

the reason of the failure is that MMFF uses its own aromaticity model 
(see 
http://www.rdkit.org/docs/GettingStartedInPython.html#working-with-3d-molecules). 
Therefore, after calling


AllChem.MMFFOptimizeMolecule(mols[0])

you will need to add the following call:

AllChem.SanitizeMol(mols[0],
  sanitizeOps = AllChem.SanitizeFlags.SANITIZE_KEKULIZE \
  | AllChem.SanitizeFlags.SANITIZE_SETAROMATICITY)

This will fix your problem.

Kind regards,
Paolo

On 05/14/2016 07:07 AM, Robert DeLisle wrote:

RDKitters,

I'm working on a project in which I want to align a collection of 
structures with their most similar structures and display the results 
in PyMOL.  To accomplish this, I've built a Python script similar to 
the one attached here in which I start with pairs of structures, find 
the MCS of those structures, create a template based on the MCS and a 
3D conformation of the structure of interest, and then generate a 
constrained conformation of a query structure.  I tried to comment the 
attached code enough to lead you through the process.


What I find is that quite often, the ConstrainedEmbed() function fails 
with the error "molecule doesn't match the core" which seems very odd 
since the pairs for which it fails are very similar. The attached .png 
shows one such pair and their MCS.


What I've found is that when I generate a 3D conformation for the 
first structure and optimize it with MMFF (MMFFOptimize), this often 
causes GetSubstructMatch to fail finding the MCS within the 
structure.  If instead I used UFFOptimize, everything seems to work OK 
most of the time.


In my code, I've noted where the error occurs and flanked it with some 
print statements to show what happens. Specficially, at like 36 I have 
the MMFFOptimize line, and at 37 the UFFOptimize line.   I've also 
attached a set of structures for which MMFF fails.


While using UFFOptimize produces great results, I'm curious regarding 
why MMFFOptimize creates a problem.  And, whether this is a bug which 
should be fixed, or just a glitch related to atom typing and other 
parameterizations that occur with MMFF.


Thanks for any explanation or ideas.

-Kirk




--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j


___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss