Dear R users, I need to come up with an efficient method to compute the correlation (or at least, the euclidean distance if that's easier) between specific rows in a data frame (46,232 rows, 29 columns). The pairs of rows between which I want to find the correlation share a common value in one of the columns. So for example, in the following x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T)) x$id=factor(x$id)
I would want to compute the correlation between the two rows (for cols a,b,c) that share the same id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is 1Gb): p=NULL for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])} Is there a more efficient way? I thought about apply/sapply etc but I don't think they'll work for rows and can't think of an intelligent way to make them work! The second problem is that I also need to know how many degrees of freedom (ie non missing pairs of values) were used in each correlation. Is there a way to also do this efficiently? I hope this makes sense! Thank you all very much in advance! Eleni ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.