Re: [R] distance in the function kmeans
n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET As the name says, kmeans() calculates *means* (centres) of clusters. It does not any make sense to do that on distances ... Uwe Ligges Accdez au courrier lectronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tl : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
On Fri, 28 May 2004, Uwe Ligges wrote: n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET As the name says, kmeans() calculates *means* (centres) of clusters. It does not any make sense to do that on distances ... Uwe Ligges That's not really true. There is an equivalent to the k-means target criterion in terms of distances, and that uses squared Euklidean distances. However, as far as I know, you cannot compute it directly in R for any other distance. Using pam is the thing which comes closest. Christian Hennig *** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg [EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/ ### ich empfehle www.boag-online.de __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Yes but how can we choose the distance to calculate centers? Thanks in advance, Nicolas BOUGET As the name says, kmeans() calculates *means* (centres) of clusters. It does not any make sense to do that on distances ... Uwe Ligges Accédez au courrier électronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tél : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat. math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Accédez au courrier électronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tél : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
[EMAIL PROTECTED] wrote: n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Yes but how can we choose the distance to calculate centers? Ah, you are going to use different distance measure (e.g. euclidean, manhattan, ...) as in other cluster methods? Well, that's not possible with the kmeans() implementation. See ?kmeans which tells you: The data given by x is clustered by the k-means algorithm. When this terminates, all cluster centres are at the mean of their Voronoi sets (the set of data points which are nearest to the cluster centre). The algorithm of Hartigan and Wong (1979) is used. Of course, you can do some projection based on the calculation of distances, but I don't think there are functions available to do that completely automatical - and interpretation of results won't be that easy ... Uwe Ligges Thanks in advance, Nicolas BOUGET As the name says, kmeans() calculates *means* (centres) of clusters. It does not any make sense to do that on distances ... Uwe Ligges Accdez au courrier lectronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tl : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat. math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Accdez au courrier lectronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tl : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
n\ == n\ bouget n on Fri, 28 May 2004 09:37:35 +0200 writes: n\ Hi, I want to know which distance is using in the n\ function kmeans and if we can change this distance. n\ Indeed, in the function pam, we can put a distance n\ matrix in parameter (by the line n\ pam-pam(dist(matrixdata),k=7) ) but we can't do it in n\ the function kmeans, we have to put the matrix of data n\ directly ... Thanks in advance, Nicolas BOUGET It might be interesting to look at this from the pam() perspective: What exactly is pam() lacking that kmeans() does for you? Christian, are you suggesting that pam() could do the job if 1) there was a dist(., method=a la kmeans) 2) pam() allowed to be started by a user-specified set of medoids instead of the Kaufman-Rousseeuw-optimal ones ? Regards, Martin Maechler __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
On Fri, 28 May 2004, Martin Maechler wrote: n\ == n\ bouget n on Fri, 28 May 2004 09:37:35 +0200 writes: n\ Hi, I want to know which distance is using in the n\ function kmeans and if we can change this distance. n\ Indeed, in the function pam, we can put a distance n\ matrix in parameter (by the line n\ pam-pam(dist(matrixdata),k=7) ) but we can't do it in n\ the function kmeans, we have to put the matrix of data n\ directly ... Thanks in advance, Nicolas BOUGET It might be interesting to look at this from the pam() perspective: What exactly is pam() lacking that kmeans() does for you? Christian, are you suggesting that pam() could do the job if 1) there was a dist(., method=a la kmeans) 2) pam() allowed to be started by a user-specified set of medoids instead of the Kaufman-Rousseeuw-optimal ones ? The k-means criterion is equivalent to: Find a partition C=C_1 \cup...\cup C_k such that \sum_{i=1}^k \sum_{x_j,x_l\in C_i} d(x_j,x_l)/|C_i|=min! d is squared Euklidean distance (see the Bock book). You may wonder to what clustering this would lead with another distance. The difference to pam is that pam minimizes sums of distances to centroid objects, which have to be part of the dataset. k-means does not need centroid objects, no mean objects are needed. Thus, pam with squared Euklidean distances is a kind of approximation to k-means. (In practice, both are approximations to a global optimum.) There would also be a further version if other distances would be allowed, the pam criterion would be optimized, but the cluster centers would be allowed to lie elsewhere than on an object of the sample. Of course, pam and the original k-means are more or less easy to compute, while the suggested alternatives may be computationally complex. Best, Christian Regards, Martin Maechler __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html *** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg [EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/ ### ich empfehle www.boag-online.de __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET One solution is to transform the data in a way, that the euclidean distance of the transformed values represents some other distance of the original values. This works at least for the Mahalanobis-Distance, when one applies a multivariate technique to a PCA transformed and re-scaled matrix, but I don't know if there are transformations for some other distance measures. Thomas P. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
I don't exactly understand what you do, could you show me the program that you execute to do that? n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET One solution is to transform the data in a way, that the euclidean distance of the transformed values represents some other distance of the original values. This works at least for the Mahalanobis-Distance, when one applies a multivariate technique to a PCA transformed and re-scaled matrix, but I don't know if there are transformations for some other distance measures. Thomas P. Accédez au courrier électronique de La Poste : www.laposte.net ; 3615 LAPOSTENET (0,34/mn) ; tél : 08 92 68 13 50 (0,34/mn) __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
Thomas Petzoldt wrote: n.bouget wrote: Hi, I want to know which distance is using in the function kmeans and if we can change this distance. Indeed, in the function pam, we can put a distance matrix in parameter (by the line pam-pam(dist(matrixdata),k=7) ) but we can't do it in the function kmeans, we have to put the matrix of data directly ... Thanks in advance, Nicolas BOUGET One solution is to transform the data in a way, that the euclidean distance of the transformed values represents some other distance of the original values. This works at least for the Mahalanobis-Distance, when one applies a multivariate technique to a PCA transformed and re-scaled matrix, but I don't know if there are transformations for some other distance measures. Thomas P. Other solutions from an ecological paper are: Chord distance Chi square metric Chi square distance Hellinger Distance Distance between species profiles All these can be seen as Euclidean distances of some transformation of the data. The paper Ecologically meaningful transformations for ordination of species data Pierre Legendre, and Eugene D. Gallagher (2001) Oecologia Vol. 129, Issue 2, 271-280, explains the concept and how to do the transformations. An R example is given in the help file of decostand() in Jari Oksanen's vegan library for two of the transformations mentioned above. Gav -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. ECRC [E] [EMAIL PROTECTED] UCL Department of Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26 Bedford Way[W] http://www.ucl.ac.uk/~ucfagls/ London. WC1H 0AP. %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
Gavin Simpson wrote: ... An R example is given in the help file of decostand() in Jari Oksanen's vegan library for two of the transformations mentioned above. ^^^ Pre-empting the usual response about proper terminology, I of course meant package not library. Gav -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Gavin Simpson [T] +44 (0)20 7679 5522 ENSIS Research Fellow [F] +44 (0)20 7679 7565 ENSIS Ltd. ECRC [E] [EMAIL PROTECTED] UCL Department of Geography [W] http://www.ucl.ac.uk/~ucfagls/cv/ 26 Bedford Way[W] http://www.ucl.ac.uk/~ucfagls/ London. WC1H 0AP. %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] distance in the function kmeans
[EMAIL PROTECTED] wrote: I don't exactly understand what you do, could you show me the program that you execute to do that? I did such things sometimes ago, so the following is (as usual) without warranty. There are several methods, e.g. using Choleski factorization, singular value decomposition or principal components. Given mdata as original data matrix it works with hclust and should be applicable to kmeans too: # with svd z - svd(scale(mdata, scale=F))$u cl - hclust(dist(z), method=ward) # with princomp (rescaled) pc - princomp(mdata, cor=FALSE) pcdata - as.data.frame(scale(pc$scores)) cl - hclust(dist(pcdata), method=ward) ... but as I mentioned, this is only an example, that methods working with the Euclidean distance can be applied to other distance measures, when an appropriate transformation of the data exist and, according to Gavin, there are indeed some other possibilities. Thomas P. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html