[R] distance in the function kmeans

Jari Oksanen Fri, 28 May 2004 22:55:29 -0700

My thread broke as I write this at home and there were no new messages on this subject after I got home. I hope this still reaches interested parties.

There are several methods that find centroids (means) from distance data. Centroid clustering methods do so, and so does classic scaling a.k.a. metric multidimensional scaling a.k.a. principal co-ordinates analysis (in R function cmdscale the means are found in C function dblcen.c in R sources). Strictly this centroid finding only works with Euclidean distances, but these methods willingly handle any other dissimilarities (or distances). Sometimes this results in anomalies like upper levels being below lower levels in cluster diagrams or in negative eigenvalues in cmdscale. In principle, kmeans could do the same if she only wanted.

Is it correct to use non-Euclidean dissimilarities when Euclidean distances were assumed? In my field (ecology) we know that Euclidean distances are often poor, and some other dissimilarities have better properties, and I think it is OK to break the rules (or `violate the assumptions'). Now we don't know what kind of dissimilarities were used in the original post (I think I never saw this specified), so we don't know if they can be euclidized directly using ideas of Petzold or Simpson. They might be semimetric or other sinful dissimilarities, too. These would be bad in the sense Uwe Ligges wrote: you wouldn't get centres of Voronoi polygons in original space, not even non-overlapping polygons. Still they might work better than the original space (who wants to be in the original space when there are better spaces floating around?)

The following trick handles the problem euclidizing space implied by any dissimilarity meaasure (metric or semimetric). Here mdata is your original (rectangular) data matrix, and dis is any dissimilarity data:

tmp <- cmdscale(dis, k=min(dim(mdata))-1, eig=TRUE)
eucspace <- tmp$points[, tmp$eig > 0.01]

The condition removes axes with negative or almost-zero eigenvalues that you will get with semimetric dissimilarities.

Then just call kmeans with eucspace as argument. If your dis is Euclidean, this is only a rotation and kmeans of eucspace and mdata should be equal. For other types of dis (even for semimetric dissimilarity) this maps your dissimilarities onto Euclidean space which in effect is the same as performing kmeans with your original dissimilarity.

Cheers, jari oksanen
--
Jari Oksanen, Oulu, Finland

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

[R] distance in the function kmeans

Reply via email to