[R] Tricky vectorization problem
Hi all, I'm using the code below within a loop that I run thousands of times and even with the super-computing resources at my disposal this is just too slow. The snippet below takes about 10s on my machines, which is an order of magnitude or two slower than would be preferable; in the end I'd like to set the number of monte carlo experiments to 1e4 or even 1e5 to ensure stable results. I've tried my hand at vectorizing it myself but I'm finding it tricky. Any help would be greatly appreciated! This code attempts to find the distribution of observed correlation values between A & B when each are measured imperfectly: start = proc.time()[1] #start a timer rho = .5 #set the true correllation Ns = 100 #set the number of Subjects a.No = 1:100 #for each subject, set a number of observations in A (1:100 chosen arbitrarily for demonstration) b.No = 1:100 #for each subject, set a number of observations in B v.a = 1 #set the between Ss variance in condition A v.b = 2 #set the between Ss variance in condition B s.a.w = 1:100 #for each subject, set a standard deviation in A (1:100 chosen arbitrarily for demonstration) s.b.w = 1:100 #for each subject, set a standard deviation in B mce = 1e3 #set the number of monte carlo experiments sim.r = vector(length=mce) #set up a collector for the simulated correlations Sigma=matrix(c(v.a,sqrt(v.a*v.b)*rho,sqrt(v.a*v.b)*rho,v.b),2,2) #define a covariance matrix for use in generating the correlated data a = vector(length=Ns) #set up a collector for subject means in A b = vector(length=Ns) #set up a collector for subject means in B for(i in 1:mce){ #start a monte carlo experiment sub.means=mvrnorm(Ns,rep(0,2),Sigma) #generate correlated ideal means for for each subject in each condition for(s in 1:Ns){ #for each subject a[s] = mean(rnorm(a.No[s],sub.means[s,1],s.a.w[s])) #generate some data for A and grab the mean b[s] = mean(rnorm(b.No[s],sub.means[s,2],s.b.w[s])) #generate some data for B and grab the mean } sim.r[i] = cor(a,b) #store the observed correlation between A and B } print(proc.time()[1]-start) #show the total time this took -- Mike Lawrence Graduate Student, Department of Psychology, Dalhousie University Website: http://memetic.ca Public calendar: http://icalx.com/public/informavore/Public "The road to wisdom? Well, it's plain and simple to express: Err and err and err again, but less and less and less." - Piet Hein __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tricky vectorization problem
Seems there were some line break issues when pasting the code, trying again with a different commenting style: #start a timer start = proc.time()[1] #set the true correllation rho = .5 #set the number of Subjects Ns = 100 #for each subject, set a number of observations in A a.No = 1:100 #for each subject, set a number of observations in B b.No = 1:100 #set the between Ss variance in condition A v.a = 1 #set the between Ss variance in condition B v.b = 2 #for each subject, set a standard deviation in A s.a.w = 1:100 #for each subject, set a standard deviation in B s.b.w = 1:100 #set the number of monte carlo experiments mce = 1e3 #set up a collector for the simulated correlations sim.r = vector(length=mce) #define a covariance matrix for use in generating the correlated data Sigma=matrix(c(v.a,sqrt(v.a*v.b)*rho,sqrt(v.a*v.b)*rho,v.b),2,2) #set up a collector for subject means in A a = vector(length=Ns) #set up a collector for subject means in B b = vector(length=Ns) #start a monte carlo experiment for(i in 1:mce){ #generate correlated ideal means for for each subject in each condition sub.means=mvrnorm(Ns,rep(0,2),Sigma) #for each subject for(s in 1:Ns){ #generate some data for A and grab the mean a[s] = mean(rnorm(a.No[s],sub.means[s,1],s.a.w[s])) #generate some data for B and grab the mean b[s] = mean(rnorm(b.No[s],sub.means[s,2],s.b.w[s])) } #store the observed correlation between A and B sim.r[i] = cor(a,b) } #show the total time this took print(proc.time()[1]-start) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tricky vectorization problem
a small improvement is to avoid computing the spectral decomposition of `Sigma' (look at code of mvrnorm()) in each iteration, e.g., #set up a collector for subject means in A a <- numeric(Ns) #set up a collector for subject means in B b <- numeric(Ns) eS <- eigen(Sigma, symmetric = TRUE, EISPACK = TRUE) ev <- eS$values fact <- eS$vectors %*% diag(sqrt(pmax(ev, 0)), 2) #start a monte carlo experiment for (i in 1:mce) { #generate correlated ideal means for each subject in each condition X <- rnorm(2 * Ns) dim(X) <- c(2, Ns) sub.means <- t(fact %*% X) #for each subject for (s in 1:Ns) { #generate some data for A and grab the mean a[s] <- mean(rnorm(a.No[s], sub.means[s, 1], s.a.w[s])) #generate some data for B and grab the mean b[s] <- mean(rnorm(b.No[s], sub.means[s, 2], s.b.w[s])) } #store the observed correlation between A and B sim.r[i] <- cor(a, b) } I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm Quoting Mike Lawrence <[EMAIL PROTECTED]>: > Seems there were some line break issues when pasting the code, trying > again with a different commenting style: > > #start a timer > start = proc.time()[1] > > #set the true correllation > rho = .5 > > #set the number of Subjects > Ns = 100 > > #for each subject, set a number of observations in A > a.No = 1:100 > #for each subject, set a number of observations in B > b.No = 1:100 > > #set the between Ss variance in condition A > v.a = 1 > #set the between Ss variance in condition B > v.b = 2 > > #for each subject, set a standard deviation in A > s.a.w = 1:100 > #for each subject, set a standard deviation in B > s.b.w = 1:100 > > #set the number of monte carlo experiments > mce = 1e3 > > #set up a collector for the simulated correlations > sim.r = vector(length=mce) > > #define a covariance matrix for use in generating the correlated data > Sigma=matrix(c(v.a,sqrt(v.a*v.b)*rho,sqrt(v.a*v.b)*rho,v.b),2,2) > > #set up a collector for subject means in A > a = vector(length=Ns) > #set up a collector for subject means in B > b = vector(length=Ns) > > #start a monte carlo experiment > for(i in 1:mce){ > > #generate correlated ideal means for for each subject in each condition > sub.means=mvrnorm(Ns,rep(0,2),Sigma) > > #for each subject > for(s in 1:Ns){ > > #generate some data for A and grab the mean > a[s] = mean(rnorm(a.No[s],sub.means[s,1],s.a.w[s])) > #generate some data for B and grab the mean > b[s] = mean(rnorm(b.No[s],sub.means[s,2],s.b.w[s])) > > } > > #store the observed correlation between A and B > sim.r[i] = cor(a,b) > > } > > #show the total time this took > print(proc.time()[1]-start) > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tricky vectorization problem
Had a realization this morning that I think achieves ceiling performance and thought I'd share with the list. I was previously generating sample data per participant and then calculating a mean, but of course this could be simplified by simply getting a single value from the sampling distribution of the mean for that participant. This speeds things up immensely (from 10.5 seconds to .5 seconds) and should be sufficient for my needs, but I'd still welcome further improvement suggestions. #start a timer start = proc.time()[1] #set the number of monte carlo experiments mce = 1e3 #set the true correllation rho = .5 #set the number of Subjects Ns = 100 #for each subject, set a number of observations in A a.No = 1:100 #for each subject, set a number of observations in B b.No = 1:100 #set the between Ss variance in condition A v.a = 1 #set the between Ss variance in condition B v.b = 2 #for each subject, set a standard deviation in A s.a.w = 1:100 #for each subject, set a standard deviation in B s.b.w = 1:100 #set up a collector for the simulated correlations sim.r = vector(length=mce) #define a covariance matrix for use in generating the correlated data Sigma=matrix(c(v.a,sqrt(v.a*v.b)*rho,sqrt(v.a*v.b)*rho,v.b),2,2) eS <- eigen(Sigma, symmetric = TRUE, EISPACK = TRUE) ev <- eS$values fact <- eS$vectors %*% diag(sqrt(pmax(ev, 0)), 2) #set up a collector for subject means in A a = vector(length=Ns) #set up a collector for subject means in B b = vector(length=Ns) #generate correlated ideal means for for each subject in each condition X <- rnorm(2 * Ns * mce) dim(X) <- c(2, Ns * mce) sub.means <- t(fact %*% X) #generate observed means for each Subject in A a=sub.means[,1]+rnorm(Ns*mce,0,s.a.w/sqrt(a.No)) #generate observed means for each Subject in B b=sub.means[,2]+rnorm(Ns*mce,0,s.b.w/sqrt(b.No)) #Get the observed correlation for each monte carlo experiment for(i in 1:mce){ sim.r[i] = cor( a[(i*Ns-Ns+1):(i*Ns)] ,b[(i*Ns-Ns+1):(i*Ns)] ) } #show the total time this took print(proc.time()[1]-start) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tricky vectorization problem
Posting this for posterity and to demonstrate that a speed-up of 2 orders of magnitude is indeed possible! I can now run 1e5 monte carlo experiments in the time the old code could only run 1e3. The final change was to replace my use of cor() with .Internal(cor()), which avoids some checks that are unnecessary in this case: #start a timer start = proc.time()[1] #set the number of monte carlo experiments mce = 1e5 #set the true correllation rho = 1 #set the number of Subjects Ns = 100 #for each subject, set a number of observations in A a.No = 1:100 #for each subject, set a number of observations in B b.No = 1:100 #set the between Ss variance in condition A v.a = 1 #set the between Ss variance in condition B v.b = 2 #for each subject, set a standard deviation in A s.a.w = 1:100 #for each subject, set a standard deviation in B s.b.w = 1:100 #set up a collector for the simulated correlations sim.r = vector(length=mce) #define a covariance matrix for use in generating the correlated data Sigma=matrix(c(v.a,sqrt(v.a*v.b)*rho,sqrt(v.a*v.b)*rho,v.b),2,2) eS <- eigen(Sigma, symmetric = TRUE, EISPACK = TRUE) ev <- eS$values fact <- eS$vectors %*% diag(sqrt(pmax(ev, 0)), 2) #set up a collector for subject means in A a = vector(length=Ns) #set up a collector for subject means in B b = vector(length=Ns) #generate correlated ideal means for for each subject in each condition X <- rnorm(2 * Ns * mce) dim(X) <- c(2, Ns * mce) sub.means <- t(fact %*% X) a=sub.means[,1]+rnorm(Ns*mce,0,s.a.w/sqrt(a.No)) b=sub.means[,2]+rnorm(Ns*mce,0,s.b.w/sqrt(b.No)) for(i in 1:mce){ end=i*Ns sim.r[i] = .Internal( cor( a[(end-Ns+1):end] ,b[(end-Ns+1):end] ,TRUE ,FALSE ) ) } #show the total time this took print(proc.time()[1]-start) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.