Thanks, Curt!  I'll give those a look.  It'll give me a very good reason to
start digging into SciPy a bit more and exploit the added functionality
that will bring.

Regarding my original question and for anyone else that might be

I did indeed find an answer through a lot of code dredging.  I found the
Murtagh.ClusterData() function in RDKit, and was able to generate clusters
from that.  The function returns a single member list, that single member
being a Cluster object.  I can feed that object to ClusterVis.ClusterToImg
to get the dendrogram I wanted.  Here's a short code snip showing the

c_tree = Murtagh.ClusterData(dists,nfps,Murtagh.WARDS,isDistData=True)
rdkit.ML.Cluster.ClusterVis.ClusterToImg(c_tree[0], size=(500,500),

I can then break the cluster tree into subtrees:

rdkit.ML.Cluster.ClusterUtils.SplitIntoNClusters(c_tree[0], 5)

And I've written a short function to extract out the individual structure
memberships for each group:


groups = ClusterUtils.SplitIntoNClusters(c_tree[0], 5)

def GetGroupMembers( grp, memberlist=[] ):
    for child in grp.GetChildren():
        if (child.GetData() is None ):
            GetGroupMembers( child, memberlist )
            memberlist.append( child.GetData() )

    return memberlist

print GetGroupMembers(groups[0])

On Sat, May 14, 2016 at 11:21 AM, Curt Fischer

> Hi Robert,
> For the number of molecules you are interested in, it's viable to use
> SciPy / NumPy clustering functions instead of rdkit's built in C-linked
> functions.  This approach will probably not be as fast rdkit's built-in
> clustering functionalities, and will probably not scale to tens of
> thousands of molecules as well as rdkit's functions, but if you use SciPy
> or NumPy in other types of technical computing, this approach may be more
> transparent, generalizable, and easier to use.
> I have an example Jupyter notebook in GitHub that describes what I mean;
> here are the GitHub and nbviewer links:
> Here are some of the most important parts of the code for generating a
> dendrogram.
> 1. Generate a numpy fingerprint matrix from a list of rdkit Molecules.
> for smiles in smiles_list:
>     mol = Chem.MolFromSmiles(smiles)
>     mols.append(mol)
> fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol, fpSize = 
> 2048), dtype = 'bool') for mol in mols)
> 2. Generate the distance matrix.  *pdist* and *squareform* are from
> *scipy.spatial.distance*.
> dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame(
> squareform(dist_mat), index = smiles_list, columns= smiles_list)
> As far as I can tell, the Jaccard distance is equivalent to one minus the
> Tanimoto similarity.
> 3. Perform hierarchical clustering on the distance matrix and show the
> dendrogram (see the github notebook for the plot). *hc* is
> *scipy.cluster.hierarchy*.
> z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z, labels=dist_df.columns, 
> leaf_rotation=90)
> A helpful page for dendrograms using SciPy is this one:
> Good luck!
> Curt
On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle
> wrote:
>> Next up is clustering...
>> I've got about 350 structures to cluster and I've worked through the
>> example code from the RDKit Cookbook (
>>  All
>> seems well and good there, but I would like to see the dendrogram.  I see
>> that there is a ClusterVis module to generate images, PDF, and SVG, but all
>> require a Cluster object as input.  I don't find anywhere a description of
>> acquiring or building that object based upon the results of clustering.
>> Any tips?
>> -Kirk
