On Tue, 30 Mar 2010, Dimitri Liakhovitski wrote:

Dear R-ers,

I have  a large data frame (several thousands of rows and about 2.5
thousand columns). One variable ("group") is a grouping variable with
over 30 levels. And I have a lot of NAs.
For each variable, I need to divide each value by variable mean - by
subgroup. I have the code but it's way too slow - takes me about 1.5
hours.
Below is a data example and my code that is too slow. Is there a
different, faster way of doing the same thing?
Thanks a lot for your advice!

Dimitri


# Building an example frame - with groups and a lot of NAs:
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))


Use model.matrix and crossprod to do this in a vectorized fashion:

mat <- as.matrix(frame[,-1])
mm <- model.matrix(~0+group,frame)
col.grp.N <- crossprod( !is.na(mat), mm )
mat[is.na(mat)] <- 0.0
col.grp.sum <- crossprod( mat, mm )
mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
is.na(mat) <- is.na(frame[,-1])


mat is now a matrix whose columns each correspond to the columns in 'frame' as you have it after do.call(...)


Are you sure you want to divide the values by their (possibly negative) means??

HTH,

Chuck



frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
      i.for.NA<-sample(1:100,60)
      frame[[i]][i.for.NA]<-NA
}
frame

### Code that does what's needed but is too slow:
Start<-Sys.time()
frame <- do.call(cbind, lapply(names.used, function(x){
 unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
}))
Finish<-Sys.time()
print(Finish-Start) # Takes too long

--
Dimitri Liakhovitski
Ninah.com
dimitri.liakhovit...@ninah.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to