Hello,
I've been experiment with the above module, and I am stuck with the cluster
object, I cannot find how to cut the tree at a specified distance and get the
list of cluster indices for the original molecules.
See below the code.
The 'Print()' function does show a tree and values for a 'Metric', but how
would I process the object any further to get indeed a specific clustering and
not a tree?
[E.g. in R one would use hclust and then cutree with h set to the desired
distance, yielding a list of integer cluster indices (obviously starting at 1,
for the joy of python aficionados :D)].
Thanks
Giovanni
import pandas as pd
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
from rdkit import DataStructs
from rdkit.ML.Cluster import Murtagh
# 20 SMILES from Enamine REAL
SMILES_list = ['O=C(c1nc(C2CC2)oc1)NC[C@@H]3CN(C(C4OCCO4)=O)CCC3',
'CC(OCc1onc(C(NC2(CC3(NC(C4[C@H]([C@@H]5C4)CC5)=O)C2)C3)=O)c1)(C)C |&1:17,16|',
'CC1(C(C(NCC2(CC2)NC(CN3C(=O)OCC3)=O)=O)CCCC1)C',
'CC(NC([C@H]1[C@@H](F)C1)=O)CC2CN(C(c3c(Br)cn(C)n3)=O)C2',
'Cc1nc(c2cc1)ccc(C(N(C(CNC(C3(C(C)(C)C3)O)=O)C)C)=O)c2',
'COc1c(S(N2Cc(n3CC2)cnc3)(=O)=O)cccc1[N+]([O-])=O',
'Cc1c(C(N2CCN(CC3OCCOC3)CC2)=O)cccc1O',
'Cc1cc(C)n(C(C(N2CCN(C(C(=O)N)=O)CCC2)=O)C)n1',
'CCOCC(C(N[C@H]1[C@H]2C[C@H](CN2C(C(C=NNC3=O)=C3)=O)C1)=O)C |&1:7,10,8|',
'Cc1ncsc1CC(NCC2CCN(C(CC3C(C)C3)=O)CC2)=O',
'Cc1nc(C=CC(NCC(N2CC(C)OC(C)C2)(C)C)=O)[nH]c1',
'Cc1nnsc1C(NC[C@H](NC(C2CC=CC2)=O)CO)=O',
'CC(C(N[C@H]1[C@H](CC2CC2)CN(C(c3[nH]ccc3)=O)C1)=O)(CC#N)C |&1:4,5|',
'CCOCCON=C1CCN(C(CCSC)=O)CC1',
'C[C@H](N(C(C1CCC=CC1)=O)C)CNC(c2cccc(OC(C)(C)C)c2)=O',
'Cc1nc(C(C)C)c(C(N2C[C@H](NC([C@@H]3C[C@H](C(=O)N)CC3)=O)CC2)=O)cc1 |&1:14,16|',
'CCc1occc1C(N2CC(OCc3nnn(C4CC4)c3)CCC2)=O',
'CN(C(CNC(C1CC1)=O)=O)CC2CCN(C(C3(N(C)CCC3)C)=O)CC2',
'CC(CCCC1)=C1C(N2CC3(CCC(NC(c4c(C)n[nH]n4)=O)CC3)CC2)=O',
'CCC(N1CC(C(O)(C)C)(CNC(Cn2c(c3cc2)cc(Cl)cc3)=O)C1)=O']
# Generate the fingerprints (Morgan radius 3, folded to 2048 bits)
fps = [rdMolDescriptors.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(sm),
radius = 3, nBits = 2048, useChirality = False) \
for sm in SMILES_list]
# Generate the distance matrix in a standard list format
# see: https://www.rdkit.org/docs/source/rdkit.ML.Cluster.Murtagh.html
# for i<j: d_ij = dists[j*(j-1)//2 + i]
distmat_list = []
for i in range(1, len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i], fps[:i])
distmat_list.extend([1 - s for s in sims])
# Create the hierarchial clustering object, using SLINK method from Murtagh
hcl_slink = Murtagh.ClusterData(distmat_list, nPts = len(fps), method =
Murtagh.SLINK, isDistData = True)
hcl_slink[0].Print()
Cluster(39) Metric: 0.876033
Cluster(6) Metric: 0.000000
Cluster(38) Metric: 0.872549
Cluster(14) Metric: 0.000000
Cluster(37) Metric: 0.852174
Cluster(29) Metric: 0.842593
Cluster(3) Metric: 0.000000
Cluster(18) Metric: 0.000000
Cluster(36) Metric: 0.850877
Cluster(7) Metric: 0.000000
Cluster(35) Metric: 0.850394
Cluster(17) Metric: 0.000000
Cluster(34) Metric: 0.850000
Cluster(19) Metric: 0.000000
Cluster(33) Metric: 0.849558
Cluster(11) Metric: 0.000000
Cluster(32) Metric: 0.848739
Cluster(13) Metric: 0.000000
Cluster(31) Metric: 0.848739
Cluster(28) Metric: 0.842105
Cluster(26) Metric: 0.833333
Cluster(1) Metric: 0.000000
Cluster(25) Metric: 0.831858
Cluster(10) Metric: 0.000000
Cluster(20) Metric: 0.000000
Cluster(27) Metric: 0.834783
Cluster(9) Metric: 0.000000
Cluster(24) Metric: 0.830357
Cluster(4) Metric: 0.000000
Cluster(23) Metric: 0.825243
Cluster(8) Metric: 0.000000
Cluster(16) Metric: 0.000000
Cluster(30) Metric: 0.842975
Cluster(2) Metric: 0.000000
Cluster(22) Metric: 0.813084
Cluster(12) Metric: 0.000000
Cluster(21) Metric: 0.791304
Cluster(5) Metric: 0.000000
Cluster(15) Metric: 0.000000
This e-mail and its attachment(s) (if any) may contain confidential and/or
proprietary information and is intended for its addressee(s) only. Any
unauthorized use of the information contained herein (including, but not
limited to, alteration, reproduction, communication, distribution or any other
form of dissemination) is strictly prohibited. If you are not the intended
addressee, please notify the originator promptly and delete this e-mail and its
attachment(s) (if any) subsequently. Neither Galapagos nor any of its
affiliates shall be liable for direct, special, indirect or consequential
damages arising from alteration of the contents of this message (by a third
party) or as a result of a virus being passed on.
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss