Dear R users,

I need to come up with an efficient method to compute the correlation (or at
least, the euclidean distance if that's easier) between specific rows in a data
frame (46,232 rows,    29 columns). The pairs of rows between which I want to
find the correlation share a common value in one of the columns. So for
example,
in the following 
 
x=data.frame(id=rep(sample(1:100000,size=10000),2),a=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),b=sample(c(NA,rnorm(10,0,1)),size=10000,
replace=T),c=sample(c(NA,rnorm(10,0,1)),size=10000, replace=T))
x$id=factor(x$id)

I would want to compute the correlation between the two rows (for cols a,b,c)
that share the same
id. Using a for loop and dist() works but takes a long time (>1 hour, my RAM is
1Gb):
p=NULL
 for(i in levels(x$id)){p[[i]]=dist(x[x$id==i, -1])}

Is there a more efficient way? I thought about apply/sapply etc but I don't
think they'll work for rows and can't think of an intelligent way to make them
work!
The second problem is that I also need to know how many degrees of freedom (ie
non missing pairs of values) were used in each correlation. Is there a way to
also do this efficiently?

I hope this makes sense! Thank you all very much in advance!

Eleni

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to