(I see that I accidentally responded to Andrew, only, earlier; I'm copying to the group this time.)
FWIW, in work on conformational clustering, I used the “most representative” molecule; that is, the real molecule closest to the mathematical centroid. This would probably be the best way of displaying a single molecule that typifies what is in the cluster. -P. On Tue, Sep 25, 2018 at 8:09 AM, Andrew Dalke <[email protected]> wrote: > On Sep 21, 2018, at 14:53, Philipp Thiel <[email protected] > tuebingen.de> wrote: > > you probably read about the Tanimoto being a proper metric in case of > having binary data > > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in > the revised edition. > > What we call Tanimoto is more broadly known as the Jaccard. Various sites > demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, > such as https://mathoverflow.net/questions/18084/is-the- > jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 . > > Going back to James T. Metz's original question, one alternative might be > to use chemfp and the Taylor-Butina clustering implementation available at: > > http://dalkescientific.com/writings/taylor_butina.py > > Following Dave Cosgrove's advice: > > > I expect James means what we used to call the cluster seed, i.e. the > molecule the cluster was based on, rather than the mathematical centroid. > Calculating distances from each cluster member to that would be quite > straightforward as a post-processing step although that would roughly > double the time taken. > > it's possible to change the reporting code from: > > for centroid_idx, members in clusters: > print(arena.ids[centroid_idx], "has", len(members), "other > members", file=outfile) > print("=>", " ".join(arena.ids[idx] for idx in members), > file=outfile) > > so it does the post-processing: > > print(len(clusters), "clusters", file=outfile) > for centroid_idx, members in clusters: > print(arena.ids[centroid_idx], "has", len(members), "other > members", file=outfile) > subarena = arena.copy(indices=members) > centroid_fp = arena.get_fingerprint(centroid_idx) > result = subarena.threshold_tanimoto_search_fp(centroid_fp, > threshold=0.0) > result.reorder() # sort so the highest scores come first > for id, score in result.get_ids_and_scores(): > print("=>", id, "score:", score) > > > Cheers, > > Andrew > [email protected] > > > > > _______________________________________________ > Rdkit-discuss mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >
_______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

