[R] Hidden Problems with Clustering Algorithms

Leonard Mada via R-help Mon, 21 Nov 2022 19:28:28 -0800

Dear R-Users,

Hidden Problems with Clustering Algorithms

I stumbled recently upon a presentation about hierarchical clustering.Unfortunately, it contains a hidden problem of clustering algorithms.The problem is deeper and I think that it warrants a closer inspectionby the statistical community.

The presentation is available online. Both the scaled & non-scaledversions show the problem.

de.NBI course - Advanced analysis of quantitative proteomics data usingR: 03b Clustering Part2

[Note: it's more like introductory notes to basic statistics]
https://www.youtube.com/watch?v=7e1uW_BhljA
times:
- at 6:15 - 6:28 & 6:29 - 7:10 [2 versions, both non-scaled]
- at 5:51 - 6:10 [the scaled version]
- same problem at 7:56;

PROBLEM

Non-Scaled Version: (e.g. the one at 6:15)
- the upper 2 rows are split into various sub-clusters;

- the top tree: a cluster is formed by the right-right sub-tree (some 17"genes" or similar "activities" / "expressions");- the left-most 2 "genes" are actually over-expressed "genes" andfunctionally really belong to the previous/right sub-cluster;


Scaled-Version: (at 5:52)

- the left-most 2 "genes" are over-expressed at the same time with theright cluster, and not otherwise;

Unfortunately, the 2 over-expressed (outliers or extreme-values) aresplit off from the relevant cluster and inserted as a separatemain-branch in the top dendrogram. Switching only the main left & rightbranches in the top tree would only mask this problem. The 2pseudo-outliers are really the (probably) upper values in the largercluster of over-expressed "genes" (all the dark genes should belong tothe same cluster).

The middle sub-cluster shows really NO activity (some 16 "genes"). Themain branches in the top tree should really split between this*NO*-activity cluster and the cluster showing activity (including the 2massively over-expressed genes). The problem is present in the scaledversion as well.

The hierarchical clustering algorithm fails. I have not analysed thedata, but some problems may contribute to this:- "gene expression" or "activity" may not be linear, but exponential orfollow some power rule: a logarithmic transformation (or some othertransformation) may have been useful;

- simple distances between clusters may be too inaccurate;

- the variance in the low-activity (middle) cluster may be very low(almost 0!), while the variance in the high-activity cluster may be muchhigher: the Mahalanobis distance or joining the sub-clusters based onsome z/t-test taking into account the different variances may be morerobust;


These questions should be addressed by more senior statisticians.

I hope that the presentation remains on-line as is, as the clusteringproblem is really easy to see and to analyse. It is impossible to detectand visualise such anomalies in a heatmap with 1,000 gene-expressions orwith 10,000 genes, or with 500-1000 samples. It is very obvious on thissmall heatmap.

I do not know if there are any robust tools to validate the generatedtrees. Inspecting by "eye" a dendrogram with > 1,000 genes and hundredsof samples is really futile.


Sincerely,

Leonard

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Hidden Problems with Clustering Algorithms

Reply via email to