Re: [R] define number of clusters in kmeans/apcluster analysis

Ulrich Bodenhofer Tue, 15 Dec 2015 03:59:23 -0800

Dear Luigi,

As the others have replied already, you cannot expect a clusteringalgorithm to produce exactly the result that you expect intuitively. Theresults of clustering algorithms depend largely on the parameters and,even more importantly, on the distance/similarity measure that is used.k-means, for instance, uses the Euclidean distance. As a result, itworks nicely for spherical clusters that have approximately the sameradius. APCluster, unless you don't choose a different similarity, usesnegative squared distances which leads to very similar properties. Yourdata set consists of two clusters, one of which is much more spread out.That some parts of the larger cluster are being assigned to the othercluster looks weird, but it is perfectly explained by the properties ofthe algorithms. There is a lot of literature about the properties ofclustering algorithms around. That's my 2 cents about this. In yourcase, however, as already pointed out in Bill Dunlap's reply, thescaling is the more important issue. k-means and apcluster do notperform any scaling of the data. Your two axes differ strongly in termsof scaling. Enter the following to see how the two clustering algorithms"see" your data (i.e. with two equally scaled axes):


    plot(z, xlim=c(0, 50), ylim=c(0, 50))

Given this, it is no longer surprising that both algorithms split thedata in the way they do.

Actually, if you re-scale the data, apcluster produces the result youexpect:


   z2 <- scale(z)
   m <- apclusterK(negDistMat(r=2), z2, K=2, verbose=TRUE)
   plot(m, z2)
   plot(m, z) ## it even works to superimpose the clustering result on
   the original data

I hope that helps.

Best regards,
Ulrich

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] define number of clusters in kmeans/apcluster analysis

Reply via email to