Hi all,
 
I've got clusters and would like to match individual records to each
cluster based on a sum of squares deviation.  For each cluster and
individual, I've got 50 variables to use (measured in the same way).
 
Matrix 1 is individuals and is 25000x50.  Matrix 2 is the cluster
centroids and is 100x50.  The same variables are found in each matrix
in the same order.  I'd like to calculate the 'distance' of matrix 1 to
matrix 2 and get a ranking of matrix 2's distances (and row
IDs 1 to 100) sorted by distance.
 
I tried using the RDIST and DIST functions but they have true (Euclidean)
distances and all I want is the sum of squares deviation across the 50 variables.
I don't know how to program the sum of squares deviation across the 50
variables and do it efficiently.  Because of the size of the data I'm not sure
that apply would work well here, that is why I was using a for loop.
 
The (highly inefficient) code I was using is below if that helps at all.
I give you permission to laugh if you want.  I'm not remotely close to a
programmer.
 
Are there any suggestions from the general readership?  I'm using the 1.9.0
on Windows XP with 1GB of RAM.
 
Thanks for your attention,
Danny

-------------------------------------------
#Calculate Euclidean distances between two sets of matrices.
library(foreign)
library(fields)
 
#centroid is small file with 100x50
centroid <- as.data.frame(read.spss("C:\\centroid.sav"))
#in_data is 25000x50
in_data <- as.data.frame(read.spss("C:\\in_vars.sav"))
 
#loop through the in_data records, calculate distances to the 100 centroids
#sort the distances in ascending order and write out the centroid # and
distance for all 100.
 
for(i in 1:nrow(in_data)) {
 
#first column is the centroid #.  columns 2 through 51 have data.
aa <- as.matrix(centroid[,2:51])

#first column is a unique identifier.  columns 2 through 51 have data.
bb <- as.matrix(in_data[i,2:51])
 
#merge the in_data row to the 100 centroids and calculate Euclidean distance.
cc <- rdist(rbind(bb,aa))
 
#take first column of distance matrix - this column is the distance of
in_data row to all 100 centroids.
dd <- as.matrix(cc[1,2:151])
 
#sort dd on distance and attach the centroid number.
ee <-c(t(cbind(sort.list(dd), sort(dd))))
 
#write sorted distance to file
write(ee,  file="C:\\cluster_distances.txt",ncol=300, append=TRUE)
 
}


        [[alternative HTML version deleted]]

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to