Dear R-Users,

Hidden Problems with Clustering Algorithms

I stumbled recently upon a presentation about hierarchical clustering. Unfortunately, it contains a hidden problem of clustering algorithms. The problem is deeper and I think that it warrants a closer inspection by the statistical community.

The presentation is available online. Both the scaled & non-scaled versions show the problem.

de.NBI course - Advanced analysis of quantitative proteomics data using R: 03b Clustering Part2
[Note: it's more like introductory notes to basic statistics]
https://www.youtube.com/watch?v=7e1uW_BhljA
times:
- at 6:15 - 6:28 & 6:29 - 7:10 [2 versions, both non-scaled]
- at 5:51 - 6:10 [the scaled version]
- same problem at 7:56;

PROBLEM

Non-Scaled Version: (e.g. the one at 6:15)
- the upper 2 rows are split into various sub-clusters;
- the top tree: a cluster is formed by the right-right sub-tree (some 17 "genes" or similar "activities" / "expressions"); - the left-most 2 "genes" are actually over-expressed "genes" and functionally really belong to the previous/right sub-cluster;

Scaled-Version: (at 5:52)
- the left-most 2 "genes" are over-expressed at the same time with the right cluster, and not otherwise;

Unfortunately, the 2 over-expressed (outliers or extreme-values) are split off from the relevant cluster and inserted as a separate main-branch in the top dendrogram. Switching only the main left & right branches in the top tree would only mask this problem. The 2 pseudo-outliers are really the (probably) upper values in the larger cluster of over-expressed "genes" (all the dark genes should belong to the same cluster).

The middle sub-cluster shows really NO activity (some 16 "genes"). The main branches in the top tree should really split between this *NO*-activity cluster and the cluster showing activity (including the 2 massively over-expressed genes). The problem is present in the scaled version as well.

The hierarchical clustering algorithm fails. I have not analysed the data, but some problems may contribute to this: - "gene expression" or "activity" may not be linear, but exponential or follow some power rule: a logarithmic transformation (or some other transformation) may have been useful;
- simple distances between clusters may be too inaccurate;
- the variance in the low-activity (middle) cluster may be very low (almost 0!), while the variance in the high-activity cluster may be much higher: the Mahalanobis distance or joining the sub-clusters based on some z/t-test taking into account the different variances may be more robust;

These questions should be addressed by more senior statisticians.

I hope that the presentation remains on-line as is, as the clustering problem is really easy to see and to analyse. It is impossible to detect and visualise such anomalies in a heatmap with 1,000 gene-expressions or with 10,000 genes, or with 500-1000 samples. It is very obvious on this small heatmap.

I do not know if there are any robust tools to validate the generated trees. Inspecting by "eye" a dendrogram with > 1,000 genes and hundreds of samples is really futile.

Sincerely,

Leonard

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to