Re: [R] Speed up studentized confidence intervals ?
Dear John, Dear Rui, I really thank you a lot for your R code. Best, SV Le jeudi 30 décembre 2021, 05:25:11 UTC+1, Fox, John a écrit : Dear varin sacha, You didn't correctly adapt the code to the median. The outer call to mean() in the last line shouldn't be replaced with median() -- it computes the proportion of intervals that include the population median. As well, you can't rely on the asymptotics of the bootstrap for a nonlinear statistic like the median with an n as small as 5, as your example, properly implemented (and with the code slightly cleaned up), illustrates: > library(boot) > set.seed(123) > s <- rgamma(n=10, shape=2, rate=5) > (m <- median(s)) [1] 0.3364465 > N <- 1000 > n <- 5 > set.seed(321) > out <- replicate(N, { + dat <- data.frame(sample(s, size=n)) + med <- function(d, i) { + median(d[i, ]) + } + boot.out <- boot(data = dat, statistic = med, R = 1) + boot.ci(boot.out, type = "bca")$bca[, 4:5] + }) > #coverage probability > mean(out[1, ] < m & m < out[2, ]) [1] 0.758 You do get the expected coverage, however, for a larger sample, here with n = 100: > N <- 1000 > n <- 100 > set.seed(321) > out <- replicate(N, { + dat <- data.frame(sample(s, size=n)) + med <- function(d, i) { + median(d[i, ]) + } + boot.out <- boot(data = dat, statistic = med, R = 1) + boot.ci(boot.out, type = "bca")$bca[, 4:5] + }) > #coverage probability > mean(out[1, ] < m & m < out[2, ]) [1] 0.952 I hope this helps, John -- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: http://socserv.mcmaster.ca/jfox/ On 2021-12-29, 2:09 PM, "R-help on behalf of varin sacha via R-help" wrote: Dear David, Dear Rui, Many thanks for your response. It perfectly works for the mean. Now I have a problem with my R code for the median. Because I always get 1 (100%) coverage probability that is more than very strange. Indeed, considering that an interval whose lower limit is the smallest value in the sample and whose upper limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, implying that the confidence of such an interval is 15/16 rather than 1 (100%), I suspect that the confidence interval I use for the median is not correctly defined for n=5 observations, and likely contains all observations in the sample ? What is wrong with my R code ? library(boot) s=rgamma(n=10,shape=2,rate=5) median(s) N <- 100 out <- replicate(N, { a<- sample(s,size=5) median(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] median(temp) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "bca")$bca[, 4:5] }) #coverage probability median(out[1, ] < median(s) & median(s) < out[2, ]) Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas a écrit : Hello, The code is running very slowly because you are recreating the function in the replicate() loop and because you are creating a data.frame also in the loop. And because in the bootstrap statistic function med() you are computing the variance of yet another loop. This is probably statistically wrong but like David says, without a problem description it's hard to say. Also, why compute variances if they are never used? Here is complete code executing in much less than 2:00 hours. Note that it passes the vector a directly to med(), not a df with just one column. library(boot) set.seed(2021) s <- sample(178:798, 10, replace = TRUE) mean(s) med <- function(d, i) { temp <- d[i] f <- mean(temp) g <- var(temp) c(Mean = f, Var = g) } N <- 1000 out <- replicate(N, { a <- sample(s, size = 5) boot.out <- boot(data = a, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1, ] < mean(s) & mean(s) < out[2, ]) #[1] 0.952 Hope this helps, Rui Barradas Às 11:45 de 19/12/21, varin sacha via R-help escreveu: > Dear R-experts, > > Here below my R code working but really really slowly ! I need 2 hours with my computer to finally get an answer ! Is there a way to improve my R code to speed it up ? At least to win 1 hour ;=) > > Many thanks > > > library(boot) > > s<- sample(178:798, 10, replace=TRUE) > mean(s) > > N <- 1000 > out <- replicate(N, { > a<- sample(s,size=5) > mean(a) > dat<-data.frame(a) > > med<-function(d,i) { > temp<-d[i,] > f<-mean(temp) > g<-var(replicate(50,mean(sample(temp,replace=T > return(c(f,g)) > > } > > boot.out <- boot(data = dat, statistic = med, R = 1) >
Re: [R] Speed up studentized confidence intervals ?
Dear varin sacha, You didn't correctly adapt the code to the median. The outer call to mean() in the last line shouldn't be replaced with median() -- it computes the proportion of intervals that include the population median. As well, you can't rely on the asymptotics of the bootstrap for a nonlinear statistic like the median with an n as small as 5, as your example, properly implemented (and with the code slightly cleaned up), illustrates: > library(boot) > set.seed(123) > s <- rgamma(n=10, shape=2, rate=5) > (m <- median(s)) [1] 0.3364465 > N <- 1000 > n <- 5 > set.seed(321) > out <- replicate(N, { + dat <- data.frame(sample(s, size=n)) + med <- function(d, i) { + median(d[i, ]) + } + boot.out <- boot(data = dat, statistic = med, R = 1) + boot.ci(boot.out, type = "bca")$bca[, 4:5] + }) > #coverage probability > mean(out[1, ] < m & m < out[2, ]) [1] 0.758 You do get the expected coverage, however, for a larger sample, here with n = 100: > N <- 1000 > n <- 100 > set.seed(321) > out <- replicate(N, { + dat <- data.frame(sample(s, size=n)) + med <- function(d, i) { + median(d[i, ]) + } + boot.out <- boot(data = dat, statistic = med, R = 1) + boot.ci(boot.out, type = "bca")$bca[, 4:5] + }) > #coverage probability > mean(out[1, ] < m & m < out[2, ]) [1] 0.952 I hope this helps, John -- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: http://socserv.mcmaster.ca/jfox/ On 2021-12-29, 2:09 PM, "R-help on behalf of varin sacha via R-help" wrote: Dear David, Dear Rui, Many thanks for your response. It perfectly works for the mean. Now I have a problem with my R code for the median. Because I always get 1 (100%) coverage probability that is more than very strange. Indeed, considering that an interval whose lower limit is the smallest value in the sample and whose upper limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, implying that the confidence of such an interval is 15/16 rather than 1 (100%), I suspect that the confidence interval I use for the median is not correctly defined for n=5 observations, and likely contains all observations in the sample ? What is wrong with my R code ? library(boot) s=rgamma(n=10,shape=2,rate=5) median(s) N <- 100 out <- replicate(N, { a<- sample(s,size=5) median(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] median(temp) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "bca")$bca[, 4:5] }) #coverage probability median(out[1, ] < median(s) & median(s) < out[2, ]) Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas a écrit : Hello, The code is running very slowly because you are recreating the function in the replicate() loop and because you are creating a data.frame also in the loop. And because in the bootstrap statistic function med() you are computing the variance of yet another loop. This is probably statistically wrong but like David says, without a problem description it's hard to say. Also, why compute variances if they are never used? Here is complete code executing in much less than 2:00 hours. Note that it passes the vector a directly to med(), not a df with just one column. library(boot) set.seed(2021) s <- sample(178:798, 10, replace = TRUE) mean(s) med <- function(d, i) { temp <- d[i] f <- mean(temp) g <- var(temp) c(Mean = f, Var = g) } N <- 1000 out <- replicate(N, { a <- sample(s, size = 5) boot.out <- boot(data = a, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1, ] < mean(s) & mean(s) < out[2, ]) #[1] 0.952 Hope this helps, Rui Barradas Às 11:45 de 19/12/21, varin sacha via R-help escreveu: > Dear R-experts, > > Here below my R code working but really really slowly ! I need 2 hours with my computer to finally get an answer ! Is there a way to improve my R code to speed it up ? At least to win 1 hour ;=) > > Many thanks > > > library(boot) > > s<- sample(178:798, 10, replace=TRUE) > mean(s) > > N <- 1000 > out <- replicate(N, { > a<- sample(s,size=5) > mean(a) > dat<-data.frame(a) > > med<-function(d,i) { > temp<-d[i,] > f<-mean(temp) > g<-var(replicate(50,mean(sample(temp,replace=T > return(c(f,g)) > > } > >boot.out <- boot(data = dat, statistic = med, R = 1) >boot.ci(boot.out, type = "stud")$stud[, 4:5] > }) > mean(out[1,] < mean(s) & mean(s) < out[2,]) >
Re: [R] Speed up studentized confidence intervals ?
On 12/29/21 11:08 AM, varin sacha via R-help wrote: Dear David, Dear Rui, Many thanks for your response. It perfectly works for the mean. Now I have a problem with my R code for the median. Because I always get 1 (100%) coverage probability that is more than very strange. Indeed, considering that an interval whose lower limit is the smallest value in the sample and whose upper limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, implying that the confidence of such an interval is 15/16 rather than 1 (100%), I suspect that the confidence interval I use for the median is not correctly defined for n=5 observations, and likely contains all observations in the sample ? What is wrong with my R code ? Seems to me that doing a bootstrap within a `replicate` call is not needed. (Use one or the other as a mechanism for replication. Here's what I would consider to be a "bootstrap" operation for estimating a 95% CI on the Gamma distributed population you created: Used a sample size of 1 rather than 10 > quantile( replicate( 1000, {median(sample(s,5))}) , .5+c(-0.475,0.475)) 2.5% 97.5% 0.1343071 0.6848352 This is using boot::boot to calculate medians of samples of size 5 > med <- function( data, indices) { + d <- data[indices[1:5]] # allows boot to select sample + return( median(d)) + } > res <- boot(data=s, med, 1000) > str(res) List of 11 $ t0 : num 0.275 $ t : num [1:1000, 1] 0.501 0.152 0.222 0.11 0.444 ... $ R : num 1000 $ data : num [1:1] 0.7304 0.4062 0.1901 0.0275 0.2748 ... $ seed : int [1:626] 10403 431 -118115842 -603122380 -2026881868 758139796 1148648893 -1161368223 1814605964 -1456558535 ... $ statistic:function (data, indices) ..- attr(*, "srcref")= 'srcref' int [1:8] 1 8 4 1 8 1 1 4 .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' $ sim : chr "ordinary" $ call : language boot(data = s, statistic = med, R = 1000) $ stype : chr "i" $ strata : num [1:1] 1 1 1 1 1 1 1 1 1 1 ... $ weights : num [1:1] 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 ... - attr(*, "class")= chr "boot" - attr(*, "boot_type")= chr "boot" > quantile( res$t , .5+c(-0.475,0.475)) 2.5% 97.5% 0.1283309 0.6821874 library(boot) s=rgamma(n=10,shape=2,rate=5) median(s) N <- 100 out <- replicate(N, { a<- sample(s,size=5) median(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] median(temp) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "bca")$bca[, 4:5] }) #coverage probability median(out[1, ] < median(s) & median(s) < out[2, ]) Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas a écrit : Hello, The code is running very slowly because you are recreating the function in the replicate() loop and because you are creating a data.frame also in the loop. And because in the bootstrap statistic function med() you are computing the variance of yet another loop. This is probably statistically wrong but like David says, without a problem description it's hard to say. Also, why compute variances if they are never used? Here is complete code executing in much less than 2:00 hours. Note that it passes the vector a directly to med(), not a df with just one column. library(boot) set.seed(2021) s <- sample(178:798, 10, replace = TRUE) mean(s) med <- function(d, i) { temp <- d[i] f <- mean(temp) g <- var(temp) c(Mean = f, Var = g) } N <- 1000 out <- replicate(N, { a <- sample(s, size = 5) boot.out <- boot(data = a, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1, ] < mean(s) & mean(s) < out[2, ]) #[1] 0.952 Hope this helps, Rui Barradas Às 11:45 de 19/12/21, varin sacha via R-help escreveu: Dear R-experts, Here below my R code working but really really slowly ! I need 2 hours with my computer to finally get an answer ! Is there a way to improve my R code to speed it up ? At least to win 1 hour ;=) Many thanks library(boot) s<- sample(178:798, 10, replace=TRUE) mean(s) N <- 1000 out <- replicate(N, { a<- sample(s,size=5) mean(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] f<-mean(temp) g<-var(replicate(50,mean(sample(temp,replace=T return(c(f,g)) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1,] < mean(s) & mean(s) < out[2,]) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible
Re: [R] Speed up studentized confidence intervals ?
Dear David, Dear Rui, Many thanks for your response. It perfectly works for the mean. Now I have a problem with my R code for the median. Because I always get 1 (100%) coverage probability that is more than very strange. Indeed, considering that an interval whose lower limit is the smallest value in the sample and whose upper limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, implying that the confidence of such an interval is 15/16 rather than 1 (100%), I suspect that the confidence interval I use for the median is not correctly defined for n=5 observations, and likely contains all observations in the sample ? What is wrong with my R code ? library(boot) s=rgamma(n=10,shape=2,rate=5) median(s) N <- 100 out <- replicate(N, { a<- sample(s,size=5) median(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] median(temp) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "bca")$bca[, 4:5] }) #coverage probability median(out[1, ] < median(s) & median(s) < out[2, ]) Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas a écrit : Hello, The code is running very slowly because you are recreating the function in the replicate() loop and because you are creating a data.frame also in the loop. And because in the bootstrap statistic function med() you are computing the variance of yet another loop. This is probably statistically wrong but like David says, without a problem description it's hard to say. Also, why compute variances if they are never used? Here is complete code executing in much less than 2:00 hours. Note that it passes the vector a directly to med(), not a df with just one column. library(boot) set.seed(2021) s <- sample(178:798, 10, replace = TRUE) mean(s) med <- function(d, i) { temp <- d[i] f <- mean(temp) g <- var(temp) c(Mean = f, Var = g) } N <- 1000 out <- replicate(N, { a <- sample(s, size = 5) boot.out <- boot(data = a, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1, ] < mean(s) & mean(s) < out[2, ]) #[1] 0.952 Hope this helps, Rui Barradas Às 11:45 de 19/12/21, varin sacha via R-help escreveu: > Dear R-experts, > > Here below my R code working but really really slowly ! I need 2 hours with > my computer to finally get an answer ! Is there a way to improve my R code to > speed it up ? At least to win 1 hour ;=) > > Many thanks > > > library(boot) > > s<- sample(178:798, 10, replace=TRUE) > mean(s) > > N <- 1000 > out <- replicate(N, { > a<- sample(s,size=5) > mean(a) > dat<-data.frame(a) > > med<-function(d,i) { > temp<-d[i,] > f<-mean(temp) > g<-var(replicate(50,mean(sample(temp,replace=T > return(c(f,g)) > > } > > boot.out <- boot(data = dat, statistic = med, R = 1) > boot.ci(boot.out, type = "stud")$stud[, 4:5] > }) > mean(out[1,] < mean(s) & mean(s) < out[2,]) > > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up studentized confidence intervals ?
Hello, The code is running very slowly because you are recreating the function in the replicate() loop and because you are creating a data.frame also in the loop. And because in the bootstrap statistic function med() you are computing the variance of yet another loop. This is probably statistically wrong but like David says, without a problem description it's hard to say. Also, why compute variances if they are never used? Here is complete code executing in much less than 2:00 hours. Note that it passes the vector a directly to med(), not a df with just one column. library(boot) set.seed(2021) s <- sample(178:798, 10, replace = TRUE) mean(s) med <- function(d, i) { temp <- d[i] f <- mean(temp) g <- var(temp) c(Mean = f, Var = g) } N <- 1000 out <- replicate(N, { a <- sample(s, size = 5) boot.out <- boot(data = a, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1, ] < mean(s) & mean(s) < out[2, ]) #[1] 0.952 Hope this helps, Rui Barradas Às 11:45 de 19/12/21, varin sacha via R-help escreveu: Dear R-experts, Here below my R code working but really really slowly ! I need 2 hours with my computer to finally get an answer ! Is there a way to improve my R code to speed it up ? At least to win 1 hour ;=) Many thanks library(boot) s<- sample(178:798, 10, replace=TRUE) mean(s) N <- 1000 out <- replicate(N, { a<- sample(s,size=5) mean(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] f<-mean(temp) g<-var(replicate(50,mean(sample(temp,replace=T return(c(f,g)) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1,] < mean(s) & mean(s) < out[2,]) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up studentized confidence intervals ?
I’m wondering if this is an X-Y problem. (A request to do X when the real problem should be doing Y. ) You haven’t explained the goals in natural or mathematical language which is leaving me to wonder why you are doing either sampling or replication (much less doing both within each iteration in the the function given to boot. ) — David Sent from my iPhone > On Dec 19, 2021, at 3:50 AM, varin sacha via R-help > wrote: > > Dear R-experts, > > Here below my R code working but really really slowly ! I need 2 hours with > my computer to finally get an answer ! Is there a way to improve my R code to > speed it up ? At least to win 1 hour ;=) > > Many thanks > > > library(boot) > > s<- sample(178:798, 10, replace=TRUE) > mean(s) > > N <- 1000 > out <- replicate(N, { > a<- sample(s,size=5) > mean(a) > dat<-data.frame(a) > > med<-function(d,i) { > temp<-d[i,] > f<-mean(temp) > g<-var(replicate(50,mean(sample(temp,replace=T > return(c(f,g)) > > } > > boot.out <- boot(data = dat, statistic = med, R = 1) > boot.ci(boot.out, type = "stud")$stud[, 4:5] > }) > mean(out[1,] < mean(s) & mean(s) < out[2,]) > > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed up studentized confidence intervals ?
Dear R-experts, Here below my R code working but really really slowly ! I need 2 hours with my computer to finally get an answer ! Is there a way to improve my R code to speed it up ? At least to win 1 hour ;=) Many thanks library(boot) s<- sample(178:798, 10, replace=TRUE) mean(s) N <- 1000 out <- replicate(N, { a<- sample(s,size=5) mean(a) dat<-data.frame(a) med<-function(d,i) { temp<-d[i,] f<-mean(temp) g<-var(replicate(50,mean(sample(temp,replace=T return(c(f,g)) } boot.out <- boot(data = dat, statistic = med, R = 1) boot.ci(boot.out, type = "stud")$stud[, 4:5] }) mean(out[1,] < mean(s) & mean(s) < out[2,]) __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] work on R speed?
Your question seems like an information-free zone. "Quick" is an opinion unless you set the boundaries of your question much more precisely. The Posting Guide strongly recommends providing a reproducible example of what you want to discuss. In this case I would suggest that you use the microbenchmark package to quantify "quick" or "not quick". In my experience, the most significant factors affecting speed are algorithms and features. You may be comparing a general purpose complete analysis function in one environment with a specific part of that analysis in another environment. On June 12, 2019 7:36:13 AM PDT, "Kai Lähteenmäki" wrote: >I tested Microsoft's linear algebra etc "racer", works well >Alsp simmer seems to be very quick. >How other developments in getting R quick? >reg Kai > > [[alternative HTML version deleted]] > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] work on R speed?
I tested Microsoft's linear algebra etc "racer", works well Alsp simmer seems to be very quick. How other developments in getting R quick? reg Kai [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed of RCppEigen Cholesky decomposition on sparse matrix
I believe you have the wrong list. (Read the Posting Guide... you seem to have R under control.) Try Rcpp-devel. FWIW You probably need to spend some time with a C++ profiler... any language can be unintentionally mis-used, and you first need to identify whether your calling code is inefficiently handling memory or invoking setup code repetitively before blaming BLAS. A reproducible example will probably help when you ask at Rcpp-devel. On November 21, 2018 10:34:33 AM PST, "Hoffman, Gabriel" wrote: >I am developing a statistical model and I have a prototype working in R >code. I make extensive use of sparse matrices, so the R code is pretty >fast, but hoped that using RCppEigen to evaluate the log-likelihood >function could avoid a lot of memory copying and be substantially >faster. However, in a simple example I am seeing that RCppEigen is >3-5x slower than standard R code for cholesky decomposition of a sparse >matrix. This is the case on R 3.5.1 using RcppEigen_0.3.3.4.0 on both >OS X and CentOS 6.9. > >Since this simple operation is so much slower it doesn�t seem like >using RCppEigen is worth it in this case. Is this an issue with BLAS, >some libraries or compiler options, or is R code really the fastest >option? > >Here is my example: > >library(Matrix) >library(inline) > ># construct sparse matrix ># > ># construct a matrix C that is N x X with S total entries >N = 1 >S = 100 >i = sample(1:1000, S, replace=TRUE) >j = sample(1:1000, S, replace=TRUE) >idx = i >= j >values = runif(S, 0, .3) >X = sparseMatrix(i=i, j=j, x = values, symmetric=FALSE ) > >C = as(crossprod(X), "dgCMatrix") > ># check sparsity fraction >S / N^2 > ># define RCppEigen code >CholeskyCppSparse<-' >using Rcpp::as; >using Eigen::Map; >using Eigen::SparseMatrix; >using Eigen::MappedSparseMatrix; >using Eigen::SimplicialLLT; > >// get data into RcppEigen >const MappedSparseMatrix Sigma(as >>(Sigma_in)); > >// compute Cholesky >typedef SimplicialLLT > SpChol; >const SpChol Ch(Sigma); >' > >CholSparse <- cxxfunction(signature(Sigma_in = "dgCMatrix"), >CholeskyCppSparse, plugin = "RcppEigen") > ># compare times >system.time(replicate(10, chol( C ))) ># output: ># user system elapsed ># 0.341 0.014 0.355 > >system.time(replicate(10, CholSparse( C ))) ># output: ># user system elapsed ># 1.639 0.046 1.687 > >> sessionInfo() >R version 3.5.1 (2018-07-02) >Platform: x86_64-apple-darwin15.6.0 (64-bit) >Running under: macOS 10.14 > >Matrix products: default >BLAS: >/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib >LAPACK: >/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib > >locale: >[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > >attached base packages: >[1] stats graphics grDevices datasets utils methods base > >other attached packages: >[1] inline_0.3.15 Matrix_1.2-15 > >loaded via a namespace (and not attached): >[1] compiler_3.5.1 RcppEigen_0.3.3.4.0 Rcpp_1.0.0 >[4] grid_3.5.1 lattice_0.20-38 > >Changing the size of the matrix and the number of entries does not >change the relative times > >Thanks, >- Gabriel > > > > > [[alternative HTML version deleted]] -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed of RCppEigen Cholesky decomposition on sparse matrix
I am developing a statistical model and I have a prototype working in R code. I make extensive use of sparse matrices, so the R code is pretty fast, but hoped that using RCppEigen to evaluate the log-likelihood function could avoid a lot of memory copying and be substantially faster. However, in a simple example I am seeing that RCppEigen is 3-5x slower than standard R code for cholesky decomposition of a sparse matrix. This is the case on R 3.5.1 using RcppEigen_0.3.3.4.0 on both OS X and CentOS 6.9. Since this simple operation is so much slower it doesn�t seem like using RCppEigen is worth it in this case. Is this an issue with BLAS, some libraries or compiler options, or is R code really the fastest option? Here is my example: library(Matrix) library(inline) # construct sparse matrix # # construct a matrix C that is N x X with S total entries N = 1 S = 100 i = sample(1:1000, S, replace=TRUE) j = sample(1:1000, S, replace=TRUE) idx = i >= j values = runif(S, 0, .3) X = sparseMatrix(i=i, j=j, x = values, symmetric=FALSE ) C = as(crossprod(X), "dgCMatrix") # check sparsity fraction S / N^2 # define RCppEigen code CholeskyCppSparse<-' using Rcpp::as; using Eigen::Map; using Eigen::SparseMatrix; using Eigen::MappedSparseMatrix; using Eigen::SimplicialLLT; // get data into RcppEigen const MappedSparseMatrix Sigma(as >(Sigma_in)); // compute Cholesky typedef SimplicialLLT > SpChol; const SpChol Ch(Sigma); ' CholSparse <- cxxfunction(signature(Sigma_in = "dgCMatrix"), CholeskyCppSparse, plugin = "RcppEigen") # compare times system.time(replicate(10, chol( C ))) # output: # user system elapsed # 0.341 0.014 0.355 system.time(replicate(10, CholSparse( C ))) # output: # user system elapsed # 1.639 0.046 1.687 > sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS 10.14 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] inline_0.3.15 Matrix_1.2-15 loaded via a namespace (and not attached): [1] compiler_3.5.1 RcppEigen_0.3.3.4.0 Rcpp_1.0.0 [4] grid_3.5.1 lattice_0.20-38 Changing the size of the matrix and the number of entries does not change the relative times Thanks, - Gabriel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed issue in simulating a stochastic process
I wish to simulate the following stochastic process, for i = 1...N individuals and t=1...T periods: y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t} where Ey_{t-1} is the average of y over the N individuals computed at time t-1. My solution (below) works but is incredibly slow. Is there a faster but still clear and readable alternative? Thanks a lot. Matteo rm(list=ls()) library(plyr) y0 = 0 lambda = 0.1 N = 20 T = 100 m_e = 0 sd_e = 1 # construct the data frame and initialize y D = data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } ybar - ddply(D,~t,summarise,mean=mean(y)) plot(ybar, col = blue, type = l) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue in simulating a stochastic process
Matteo, I tried your example code using R 3.1.1 on an iMac (24-inch, Early 2009), 3.06 GHz Intel Core 2 Duo, 8 GB 1333 MHz DDR3, NVIDIA GeForce GT 130 512 MB running Mac OS X 10.10 (Yosemite). After entering your code, the elapsed time from the time I hit return to when the graphics appeared was about 2 seconds — is this about what you are seeing? Regards, Tom On Thu, Nov 6, 2014 at 7:47 AM, Matteo Richiardi matteo.richia...@gmail.com wrote: I wish to simulate the following stochastic process, for i = 1...N individuals and t=1...T periods: y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t} where Ey_{t-1} is the average of y over the N individuals computed at time t-1. My solution (below) works but is incredibly slow. Is there a faster but still clear and readable alternative? Thanks a lot. Matteo rm(list=ls()) library(plyr) y0 = 0 lambda = 0.1 N = 20 T = 100 m_e = 0 sd_e = 1 # construct the data frame and initialize y D = data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } ybar - ddply(D,~t,summarise,mean=mean(y)) plot(ybar, col = blue, type = l) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue in simulating a stochastic process
Matteo, Ah — OK, N=20, I did not catch that. You have nested for loops, which R is known to be exceedingly slow at handling — if you can reorganize the code to eliminate the loops, your performance will increase significantly. Tom On Thu, Nov 6, 2014 at 7:47 AM, Matteo Richiardi matteo.richia...@gmail.com wrote: I wish to simulate the following stochastic process, for i = 1...N individuals and t=1...T periods: y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t} where Ey_{t-1} is the average of y over the N individuals computed at time t-1. My solution (below) works but is incredibly slow. Is there a faster but still clear and readable alternative? Thanks a lot. Matteo rm(list=ls()) library(plyr) y0 = 0 lambda = 0.1 N = 20 T = 100 m_e = 0 sd_e = 1 # construct the data frame and initialize y D = data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } ybar - ddply(D,~t,summarise,mean=mean(y)) plot(ybar, col = blue, type = l) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue in simulating a stochastic process
I find that representing the simulated data as a T row by N column matrix allows for a clearer and faster simulation function. E.g., compare the output of the following two functions, the first of which uses your code and the second a matrix representation (which I convert to a data.frame at the end so I can compare outputs easily). I timed both of them for T=10^3 times and N=50 individuals; both gave the same results and f1 was 1 times faster than f0: set.seed(1); t0 - system.time(s0 - f0(N=50,T=1000)) set.seed(1); t1 - system.time(s1 - f1(N=50,T=1000)) rbind(t0, t1) user.self sys.self elapsed user.child sys.child t0436.87 0.11 438.48 NANA t1 0.04 0.000.04 NANA all.equal(s0, s1) [1] TRUE The functions are: f0 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0) { # construct the data frame and initialize y D - data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } D } f1 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0) { # same process simulated using a matrix representation # The T rows are times, the N columns are individuals M - matrix(y0, nrow=T, ncol=N) if (T 1) for(t in 2:T) { ybar.L1 - mean(M[t-1L,]) epsilon - rnorm(N, mean=m_e, sd=sd_e) M[t,] - lambda * y0 + (1-lambda)*ybar.L1 + epsilon } # convert to the data.frame representation that f0 uses tM - t(M) data.frame(id = as.vector(row(tM)), t = as.vector(col(tM)), y = as.vector(tM)) } Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Nov 6, 2014 at 6:47 AM, Matteo Richiardi matteo.richia...@gmail.com wrote: I wish to simulate the following stochastic process, for i = 1...N individuals and t=1...T periods: y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t} where Ey_{t-1} is the average of y over the N individuals computed at time t-1. My solution (below) works but is incredibly slow. Is there a faster but still clear and readable alternative? Thanks a lot. Matteo rm(list=ls()) library(plyr) y0 = 0 lambda = 0.1 N = 20 T = 100 m_e = 0 sd_e = 1 # construct the data frame and initialize y D = data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } ybar - ddply(D,~t,summarise,mean=mean(y)) plot(ybar, col = blue, type = l) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue in simulating a stochastic process
Loops are not slow, but your code did a lot of unneeded operations in each loop. E.g, you computed D$id==i D$t==t for each row of D. That involves 2*nrow(D) equality tests for each of the nrow(D) rows, i.e., it is quadratic in N*T. Then you did a data.frame replacement operation D[k,]$y - newValue where k is D$id==1D$t==t. This extracts the k'th row of D, then extracts the 1-row 'y' column from it, replaces it with the new value, then puts that row back into D. If you must use a data.frame, the equivalent D$y[k] - newValue is probably much faster (data.frames are lists of columns, so replacing a column is fast). Using a matrix to organize things is less flexible, but faster because you don't have to search when you want to find the element for a given id and time - you just do a little arithmetic to get the offset from the start of the matrix. Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Nov 6, 2014 at 2:05 PM, Matteo Richiardi matteo.richia...@gmail.com wrote: Hi William, that's super. Thanks a lot. I knew that R is slow with loops, but did not imagine so slow! B.t.w., what's the reason? Final question: in your code you have mean(M[t-1L,]): what is the 'L' for? I removed it at apparently the code produces the same output... Thanks again, Matteo On 6 November 2014 18:46, William Dunlap wdun...@tibco.com wrote: I find that representing the simulated data as a T row by N column matrix allows for a clearer and faster simulation function. E.g., compare the output of the following two functions, the first of which uses your code and the second a matrix representation (which I convert to a data.frame at the end so I can compare outputs easily). I timed both of them for T=10^3 times and N=50 individuals; both gave the same results and f1 was 1 times faster than f0: set.seed(1); t0 - system.time(s0 - f0(N=50,T=1000)) set.seed(1); t1 - system.time(s1 - f1(N=50,T=1000)) rbind(t0, t1) user.self sys.self elapsed user.child sys.child t0436.87 0.11 438.48 NANA t1 0.04 0.000.04 NANA all.equal(s0, s1) [1] TRUE The functions are: f0 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0) { # construct the data frame and initialize y D - data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } D } f1 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0) { # same process simulated using a matrix representation # The T rows are times, the N columns are individuals M - matrix(y0, nrow=T, ncol=N) if (T 1) for(t in 2:T) { ybar.L1 - mean(M[t-1L,]) epsilon - rnorm(N, mean=m_e, sd=sd_e) M[t,] - lambda * y0 + (1-lambda)*ybar.L1 + epsilon } # convert to the data.frame representation that f0 uses tM - t(M) data.frame(id = as.vector(row(tM)), t = as.vector(col(tM)), y = as.vector(tM)) } Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Nov 6, 2014 at 6:47 AM, Matteo Richiardi matteo.richia...@gmail.com wrote: I wish to simulate the following stochastic process, for i = 1...N individuals and t=1...T periods: y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t} where Ey_{t-1} is the average of y over the N individuals computed at time t-1. My solution (below) works but is incredibly slow. Is there a faster but still clear and readable alternative? Thanks a lot. Matteo rm(list=ls()) library(plyr) y0 = 0 lambda = 0.1 N = 20 T = 100 m_e = 0 sd_e = 1 # construct the data frame and initialize y D = data.frame( id = rep(1:N,T), t = rep(1:T, each = N), y = rep(y0,N*T) ) # update y for(t in 2:T){ ybar.L1 = mean(D[D$t==t-1,y]) for(i in 1:N){ epsilon = rnorm(1,mean=m_e,sd=sd_e) D[D$id==i D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon } } ybar - ddply(D,~t,summarise,mean=mean(y)) plot(ybar, col = blue, type = l) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue in simulating a stochastic process
SNIP On Thu, Nov 6, 2014 at 2:05 PM, Matteo Richiardi matteo.richia...@gmail.com wrote: SNIP Final question: in your code you have mean(M[t-1L,]): what is the 'L' for? I removed it at apparently the code produces the same output... SNIP The constant 1L is stored as an integer; the constant 1 is stored as double precision. This sometimes makes no difference and sometimes makes a huge difference (especially in the context of numerical comparisons). If something is supposed to be an integer it is safer to use the L form. See ?NumericConstants. cheers, Rolf Turner -- Rolf Turner Technical Editor ANZJS __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R speed test - for processor and for RAM size
Hello! I am sorry if my question sounds naive; it's because I am not a computer scientist. I understand that two factors impact a PC's speed, the processor and (indirectly), the RAM size. I would like to run a speed test in R (under Windows). I found lots of different code snippets testing the speed. However, I'd like to get some hints or find a code that (loosely) has: (a) Aspect A that makes the code run faster if the processing speed of the PC is higher. (b) Aspect B that makes the code run faster if your PC's RAM is larger Even if the 2 Aspects are not 100% independent, it would still be OK. I am just trying to isolate the impact of those 2 things (processor and RAM) on speed of this code. This way our IT people could change some parameter in the code that impacts Aspect A or another parameter that impacts Aspect B. Thanks a lot for your hints! -- Dimitri Liakhovitski [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
Good idea! I'm trying your approach right now, but I am wondering if using str_split (package: 'stringr') or strsplit is the right way to go in terms of speed? I ran str_split over the text column of the data frame and it's processing for 2 hours now..? I did: splittedStrings-str_split(dataframe$text, ) The $text column already contains cleaned text, so no double blanks etc or unnecessary symbols. Just full words. -- View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679904.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
I'll answer myself: using strsplit with fixed=true took like 2minutes! -- View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679905.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
If you could, please identify which responder's idea you used, as well as the strsplit -- related code you ended up with. That may help someone who browses the mail archives in the future. Carl SPi wrote I'll answer myself: using strsplit with fixed=true took like 2minutes! -- View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679906.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
It is not reproducible [1] because I cannot run your (representative) example. The type of regex pattern, token, and even the character of the data you are searching can affect possible optimizations. Note that a non-memory-resident tool such as sed or perl may be an appropriate tool for a problem like this. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let me know what more information you need in order to reproduce this example? It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us: Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let me know what more information you need in order to reproduce this example? It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us: Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
what is missing is any idea of what the 'patterns' are that you are searching for. Regular expressions are very sensitive to how you specify the pattern. you indicated that you have up to 500 elements in the pattern, so what does it look like? alternation and backtracking can be very expensive. so a lot more specificity is required. there are whole books written on how pattern matching works and what is hard and what is easy. this is true for wherever regular expressions are used, not just in R. also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours. Sent from my iPad On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote: How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let me know what more information you need in order to reproduce this example? It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us: Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
But note too what the help says: Performance considerations: If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and ‘fixed = TRUE’ faster still (especially when each pattern is matched only a few times). (and there is more). I don't see perl=TRUE here. On 05/11/2013 09:06, Jim Holtman wrote: what is missing is any idea of what the 'patterns' are that you are searching for. Regular expressions are very sensitive to how you specify the pattern. you indicated that you have up to 500 elements in the pattern, so what does it look like? alternation and backtracking can be very expensive. so a lot more specificity is required. there are whole books written on how pattern matching works and what is hard and what is easy. this is true for wherever regular expressions are used, not just in R. also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours. Sent from my iPad On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote: How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let me know what more information you need in order to reproduce this example? It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us: Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
Thanks everybody! Now I understand the need for more details: the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. Second, I have four vectors with regular expressions, all similar to this on: replace_url - c(„https?://.*\\s|www.*\\s“) The text strings I perform the gsub commands on, look like this (no string is longer than 200 characters): 'GOOGL announced new partnership www.url.com. Stock price is up +5%‘ After performing several gsubs in a row, like gsub(replace_url, “[url]“,dataframe$text_column) gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) etc. this string will look like this: '[sp500_ticker] announced new partnership [url]. Stock price is up [positive_percentage]‘ The dataset contains 4 million entries. The code works, but I I cancelled the process after 1 day (my whole system was blocked while R was running). Performing the code on a smaller chunck of data (1 million) took about 12hrs. As far as I can say, replacing the ticker symbols takes the longest, while the regular expressions went quite fast Thanks! Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley rip...@stats.ox.ac.uk: But note too what the help says: Performance considerations: If you are doing a lot of regular expression matching, including on very long strings, you will want to consider the options used. Generally PCRE will be faster than the default regular expression engine, and ‘fixed = TRUE’ faster still (especially when each pattern is matched only a few times). (and there is more). I don't see perl=TRUE here. On 05/11/2013 09:06, Jim Holtman wrote: what is missing is any idea of what the 'patterns' are that you are searching for. Regular expressions are very sensitive to how you specify the pattern. you indicated that you have up to 500 elements in the pattern, so what does it look like? alternation and backtracking can be very expensive. so a lot more specificity is required. there are whole books written on how pattern matching works and what is hard and what is easy. this is true for wherever regular expressions are used, not just in R. also some idea of what the timing is; are you talking about 1-10-100 seconds/minutes/hours. Sent from my iPad On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote: How’s that not reproducible? 1. Data frame, one column with text strings 2. Size of data frame= 4million observations 3. A bunch of gsubs in a row ( gsub(patternvector, “[token]“,dataframe$text_column) ) 4. General question: How to speed up string operations on ‘large' data sets? Please let me know what more information you need in order to reproduce this example? It’s more a general type of question, while I think the description above gives you a specific picture of what I’m doing right now. General question: Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us: Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible
Re: [R] speed issue: gsub on large data frame
My feeling is that the **result** you want is far more easily achievable via a substitution table or a hash table. Someone better versed in those areas may want to chime in. I'm thinking more or less of splitting your character strings into vectors (separate elements at whitespace) and chunking away. Something like charvec[charvec==dataframe$text_column[k]] - dataframe$replace_column[k] Simon Pickert wrote Thanks everybody! Now I understand the need for more details: the patterns for the gsubs are of different kinds.First, I have character strings, I need to replace. Therefore, I have around 5000 stock ticker symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. Second, I have four vectors with regular expressions, all similar to this on: replace_url - c(„https?://.*\\s|www.*\\s“) The text strings I perform the gsub commands on, look like this (no string is longer than 200 characters): 'GOOGL announced new partnership www.url.com. Stock price is up +5%‘ After performing several gsubs in a row, like gsub(replace_url, “[url]“,dataframe$text_column) gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) etc. this string will look like this: '[sp500_ticker] announced new partnership [url]. Stock price is up [positive_percentage]‘ -- View this message in context: http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679769.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed issue: gsub on large data frame
Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed issue: gsub on large data frame
Example not reproducible. Communication fail. Please refer to Posting Guide. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Simon Pickert simon.pick...@t-online.de wrote: Hi R’lers, I’m running into speeding issues, performing a bunch of „gsub(patternvector, [token],dataframe$text_column) on a data frame containing 4millionentries. (The “patternvectors“ contain up to 500 elements) Is there any better/faster way than performing like 20 gsub commands in a row? Thanks! Simon __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
Thanks Brian, I thought that forking clusters was better ... but as you mentioned, it is not available on windows. Unfortunately, you do not always choose the OS used by your company ! Arnaud Date: Mon, 28 Oct 2013 17:59:10 + From: Prof Brian Ripley rip...@stats.ox.ac.uk To: r-help@r-project.org Subject: Re: [R] speed of makeCluster (package parallel) Message-ID: 526ea5ee.9060...@stats.ox.ac.uk Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 28/10/2013 16:19, Arnaud Mosnier wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of It increases linearly in my tests (or a decent OS). But really if parallel computing is worthwhile you will be doing minutes of work on each worker process and the startup time will not be signifcant. thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. The first way to speed things up: use a decent OS: forking clusters is much faster. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed of makeCluster (package parallel)
Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
See library(help = parallel”) On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
Thanks Simon, I already read the parallel vignette but I did not found what I wanted. May be you can be more specific on a part of the document that can provide me hints ! Arnaud 2013/10/28 Simon Zehnder szehn...@uni-bonn.de See library(help = parallel) On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
First, use only the number of cores as a number of thread - i.e. I would not use hyper threading, etc.. Each core has its own caches and it is always fortunate if a process has enough memory; hyper threads use all the same cache on the core they are running on. detectCores() gives me for example 4 - I know I have 2. I would therefore call makeCluster() with nnode = 2. mcaffinity() lets you perform a technique called process-pinning (see process affinity) and is only possible if the OS supports it. It makes sometimes sense to assign certain processes to certain CPUs such that each process has enough memory in caches (e.g. for a 16 Core machine using 8 processes on CPUs 1, 3, 5, 7, 9, 11, 13 and 15; so each process has the cache of two CPUs). A lot of functions though are not available for Windows. At first it comes always the problem you want to solve and then you look how much memory will be used in a process and how much you have (more often the memory bandwidth is the bottleneck and not the computing power). Look at the architecture of your chips (how much L1 Cache, how much L2 cache). Then you decide how many cores to use and if it makes sense to pin processes to certain cores. There are no general recipes for parallel computing - each problem is different. Some problems are even not scalable. Simon On 28 Oct 2013, at 17:51, Arnaud Mosnier a.mosn...@gmail.com wrote: Thanks Simon, I already read the parallel vignette but I did not found what I wanted. May be you can be more specific on a part of the document that can provide me hints ! Arnaud 2013/10/28 Simon Zehnder szehn...@uni-bonn.de See library(help = parallel”) On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
On 28/10/2013 16:19, Arnaud Mosnier wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of It increases linearly in my tests (or a decent OS). But really if parallel computing is worthwhile you will be doing minutes of work on each worker process and the startup time will not be signifcant. thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. The first way to speed things up: use a decent OS: forking clusters is much faster. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of makeCluster (package parallel)
Thanks a lot Simon, that's useful. I will take a look at this process-pinning things. Arnaud 2013/10/28 Simon Zehnder szehn...@uni-bonn.de First, use only the number of cores as a number of thread - i.e. I would not use hyper threading, etc.. Each core has its own caches and it is always fortunate if a process has enough memory; hyper threads use all the same cache on the core they are running on. detectCores() gives me for example 4 - I know I have 2. I would therefore call makeCluster() with nnode = 2. mcaffinity() lets you perform a technique called process-pinning (see process affinity) and is only possible if the OS supports it. It makes sometimes sense to assign certain processes to certain CPUs such that each process has enough memory in caches (e.g. for a 16 Core machine using 8 processes on CPUs 1, 3, 5, 7, 9, 11, 13 and 15; so each process has the cache of two CPUs). A lot of functions though are not available for Windows. At first it comes always the problem you want to solve and then you look how much memory will be used in a process and how much you have (more often the memory bandwidth is the bottleneck and not the computing power). Look at the architecture of your chips (how much L1 Cache, how much L2 cache). Then you decide how many cores to use and if it makes sense to pin processes to certain cores. There are no general recipes for parallel computing - each problem is different. Some problems are even not scalable. Simon On 28 Oct 2013, at 17:51, Arnaud Mosnier a.mosn...@gmail.com wrote: Thanks Simon, I already read the parallel vignette but I did not found what I wanted. May be you can be more specific on a part of the document that can provide me hints ! Arnaud 2013/10/28 Simon Zehnder szehn...@uni-bonn.de See library(help = parallel) On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote: Hi all, I am quite new in the world of parallelization and I wonder if there is a way to increase the speed of creation of a parallel socket cluster. The time spend to include threads increase exponentially with the number of thread considered and I use of computer with two 8 cores CPU and thus showing a total of 32 threads in windows 7. Currently, I use the default parameters (type = PSOCK), but is there any fine tuning parameters that I can use to take advantage of this system ? Thanks in advance for your help ! Arnaud R version 3.0.1 (2013-05-16) Platform: x86_64-w64-mingw32/x64 (64-bit) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up a function
Dear Petr, Sorry for the delay. I've been out. Unfortunately, your code doesn't work either even when using fromLast = T. Thank you for your help and your time. Santi From: PIKAL Petr petr.pi...@precheza.cz To: Santiago Guallar sgual...@yahoo.com Cc: r-help r-help@r-project.org Sent: Wednesday, July 10, 2013 8:35 AM Subject: RE: [R] spped up a function Hi Santiago  Keep conversation in list. Others can have better ideas.  I am still messing the reasoning  Merge seems to me the solution but I am lost in your resoning what to keep and what to discard from resulting object.  After merge I have this  result - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756, 6140855, 6143070, 6143090, 6143093, 6175711, 6175726, 6175730, 6175769, 6175776, 6175784, 6188609, 6188705, 6195159, 6195171, 6198153, 6198154, 6198156, 6198157, 6198172), class = factor), jul = c(15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135), timepos = structure(c(1307680575, 1307680740, 1307681040, 1307681340, 1307681640, 1307681940, 1307682240, 1307682540, 1307682780, 1307683080, 1307683380, 1307683680, 1307683980, 1307684280, 1307684397, 1307684424, 1307684484, 1307684490, 1307684580, 1307684880, 1307685180, 1307685243, 1307685321, 1307685336), class = c(POSIXct, POSIXt), tzone = GMT), act = c(3822L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 27L, 60L, 6L, 753L, NA, NA, NA, 78L, 15L, 18L), wd = c(dry, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, wet, dry, wet, dry, NA, NA, NA, wet, dry, wet)), .Names = c(Ring, jul, timepos, act, wd ), row.names = c(NA, -24L), class = data.frame)  result      Ring  jul            timepos act  wd 1 6106933 15135 2011-06-10 04:36:15 3822 dry 2 6106933 15135 2011-06-10 04:39:00  NA NA 3 6106933 15135 2011-06-10 04:44:00  NA NA 4 6106933 15135 2011-06-10 04:49:00  NA NA 5 6106933 15135 2011-06-10 04:54:00  NA NA 6 6106933 15135 2011-06-10 04:59:00  NA NA 7 6106933 15135 2011-06-10 05:04:00  NA NA 8 6106933 15135 2011-06-10 05:09:00  NA NA 9 6106933 15135 2011-06-10 05:13:00  NA NA 10 6106933 15135 2011-06-10 05:18:00  NA NA 11 6106933 15135 2011-06-10 05:23:00  NA NA 12 6106933 15135 2011-06-10 05:28:00  NA NA 13 6106933 15135 2011-06-10 05:33:00  NA NA 14 6106933 15135 2011-06-10 05:38:00  NA NA 15 6106933 15135 2011-06-10 05:39:57  27 wet 16 6106933 15135 2011-06-10 05:40:24  60 dry 17 6106933 15135 2011-06-10 05:41:24   6 wet 18 6106933 15135 2011-06-10 05:41:30 753 dry 19 6106933 15135 2011-06-10 05:43:00  NA NA 20 6106933 15135 2011-06-10 05:48:00  NA NA 21 6106933 15135 2011-06-10 05:53:00  NA NA 22 6106933 15135 2011-06-10 05:54:03  78 wet 23 6106933 15135 2011-06-10 05:55:21  15  dry 24 6106933 15135 2011-06-10 05:55:36  18 wet  I understand you want to keep only time values from GPL data.frame. OK this can be done in the last step. But I am a bit lost in the logic for discarding lines 15-18. Anyway, this can be what you want  library(zoo) result$wd-na.locf(result$wd) final-result[is.na(result$act),] final      Ring  jul            timepos act wd 2 6106933 15135 2011-06-10 04:39:00 NA dry 3 6106933 15135 2011-06-10 04:44:00 NA dry 4 6106933 15135 2011-06-10 04:49:00 NA dry 5 6106933 15135 2011-06-10 04:54:00 NA dry 6 6106933 15135 2011-06-10 04:59:00 NA dry 7 6106933 15135 2011-06-10 05:04:00 NA dry 8 6106933 15135 2011-06-10 05:09:00 NA dry 9 6106933 15135 2011-06-10 05:13:00 NA dry 10 6106933 15135 2011-06-10 05:18:00 NA dry 11 6106933 15135 2011-06-10 05:23:00 NA dry 12 6106933 15135 2011-06-10 05:28:00 NA dry 13 6106933 15135 2011-06-10 05:33:00 NA dry 14 6106933 15135 2011-06-10 05:38:00 NA dry 19 6106933 15135 2011-06-10 05:43:00 NA dry 20 6106933 15135 2011-06-10 05:48:00 NA dry 21 6106933 15135 2011-06-10 05:53:00 NA dry   Regards Petr  From:Santiago Guallar [mailto:sgual...@yahoo.com] Sent: Tuesday, July 09, 2013 10:02 PM To: PIKAL Petr Subject: Re: [R] spped up a function  Dear Petr,  I wanted the two data sets merged in such a way that the values of the 'wd' vector (from the intervals t of 'xact') are assigned to the corresponding intervals of 'GPS'. If there is more than one value (i.e if there is more than one interval of 'xact' for the corresponding interval of 'GPS'), then take the maximum (i.e. the value of the interval of 'xact' closest to the corresponding interval of 'GPS'). This is why the output of the particular sequence of the result I copied in the previous message contains only 'dry'.  Santi   From:PIKAL Petr petr.pi...@precheza.cz
Re: [R] speed up a function
Hm, so you probably ask to get something what you actually do not want. AFAIK what I called âfinalâ is the same as you asked for with your toy data except of column âactâ which you can easily get rid of. If what I suggested does not work with your real data you shall prepare better example, with which my suggestion does not give desired results. Regards Petr GPS - structure(list(Ring = c(6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L), jul = c(15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135), timepos = structure(c(1307680740, 1307681040, 1307681340, 1307681640, 1307681940, 1307682240, 1307682540, 1307682780, 1307683080, 1307683380, 1307683680, 1307683980, 1307684280, 1307684580, 1307684880, 1307685180), class = c(POSIXct, POSIXt ), tzone = GMT)), .Names = c(Ring, jul, timepos), row.names = c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), class = data.frame) xact - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756, 6140855, 6143070, 6143090, 6143093, 6175711, 6175726, 6175730, 6175769, 6175776, 6175784, 6188609, 6188705, 6195159, 6195171, 6198153, 6198154, 6198156, 6198157, 6198172), class = factor), jul = c(15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135), timepos = structure(c(1307680575, 1307684397, 1307684424, 1307684484, 1307684490, 1307685243, 1307685321, 1307685336), class = c(POSIXct, POSIXt), tzone = GMT), act = c(3822L, 27L, 60L, 6L, 753L, 78L, 15L, 18L), wd = c(dry, wet, dry, wet, dry, wet, dry, wet)), .Names = c(Ring, jul, timepos, act, wd), row.names = 170:177, class = data.frame) GPS$Ring-factor(GPS$Ring) result-merge(xact, GPS, all=T) library(zoo) result$wd-na.locf(result$wd) final-result[is.na(result$act),] final final Ring jul timepos act wd 2 6106933 15135 2011-06-10 04:39:00 NA dry 3 6106933 15135 2011-06-10 04:44:00 NA dry 4 6106933 15135 2011-06-10 04:49:00 NA dry 5 6106933 15135 2011-06-10 04:54:00 NA dry 6 6106933 15135 2011-06-10 04:59:00 NA dry 7 6106933 15135 2011-06-10 05:04:00 NA dry 8 6106933 15135 2011-06-10 05:09:00 NA dry 9 6106933 15135 2011-06-10 05:13:00 NA dry 10 6106933 15135 2011-06-10 05:18:00 NA dry 11 6106933 15135 2011-06-10 05:23:00 NA dry 12 6106933 15135 2011-06-10 05:28:00 NA dry 13 6106933 15135 2011-06-10 05:33:00 NA dry 14 6106933 15135 2011-06-10 05:38:00 NA dry 19 6106933 15135 2011-06-10 05:43:00 NA dry 20 6106933 15135 2011-06-10 05:48:00 NA dry 21 6106933 15135 2011-06-10 05:53:00 NA dry This is what you have asked for. Seems the same to me. head(GPS1, 16) and desired result (added column wd) Ring jul timepos wd 5 6106933 15135 2011-06-10 04:39:00 dry 6 6106933 15135 2011-06-10 04:44:00 dry 7 6106933 15135 2011-06-10 04:49:00 dry 8 6106933 15135 2011-06-10 04:54:00 dry 9 6106933 15135 2011-06-10 04:59:00 dry 10 6106933 15135 2011-06-10 05:04:00 dry 11 6106933 15135 2011-06-10 05:09:00 dry 12 6106933 15135 2011-06-10 05:13:00 dry 13 6106933 15135 2011-06-10 05:18:00 dry 14 6106933 15135 2011-06-10 05:23:00 dry 15 6106933 15135 2011-06-10 05:28:00 dry 16 6106933 15135 2011-06-10 05:33:00 dry 17 6106933 15135 2011-06-10 05:38:00 dry 18 6106933 15135 2011-06-10 05:43:00 dry 19 6106933 15135 2011-06-10 05:48:00 dry 20 6106933 15135 2011-06-10 05:53:00 dry Petr Pikal From: Santiago Guallar [mailto:sgual...@yahoo.com] Sent: Monday, July 15, 2013 4:29 PM To: PIKAL Petr Cc: r-help Subject: Re: [R] speed up a function Dear Petr, Sorry for the delay. I've been out. Unfortunately, your code doesn't work either even when using fromLast = T. Thank you for your help and your time. Santi From: PIKAL Petr petr.pi...@precheza.czmailto:petr.pi...@precheza.cz To: Santiago Guallar sgual...@yahoo.commailto:sgual...@yahoo.com Cc: r-help r-help@r-project.orgmailto:r-help@r-project.org Sent: Wednesday, July 10, 2013 8:35 AM Subject: RE: [R] spped up a function Hi Santiago Keep conversation in list. Others can have better ideas. I am still messing the reasoning Merge seems to me the solution but I am lost in your resoning what to keep and what to discard from resulting object. After merge I have this result - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756, 6140855, 6143070, 6143090, 6143093, 6175711, 6175726, 6175730, 6175769, 6175776, 6175784, 6188609, 6188705, 6195159, 6195171, 6198153, 6198154, 6198156, 6198157, 6198172), class = factor), jul = c(15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135), timepos
[R] Speed up or alternative to 'For' loop
I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
Hello, One way to speed it up is to use a matrix instead of a data.frame. Since data.frames can hold data of all classes, the access to their elements is slow. And your data is all numeric so it can be hold in a matrix. The second way below gave me a speed up by a factor of 50. system.time({ for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } }) system.time({ df2 - data.matrix(df) for(i in seq_len(nrow(df2))[-1]){ if(df2[i, TreeID] == df2[i - 1, TreeID]) df2[i, HeightGrowth] - df2[i, Height] - df2[i - 1, Height] } }) all.equal(df, as.data.frame(df2)) # TRUE Hope this helps, Rui Barradas Em 10-06-2013 18:28, Trevor Walker escreveu: I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
How about for (ir in unique(df$TreeID)) { in.ir - df$TreeID == ir df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir]) } Seemed fast enough to me. In R, it is generally good to look for ways to operate on entire vectors or arrays, rather than element by element within them. The cumsum() function does that in this example. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote: I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
On Jun 10, 2013, at 10:28 AM, Trevor Walker wrote: I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Ivoid tests with if(){}e;se(). Use vectorized code, possibly with 'ifelse' but in this case you need a function that does calcualtions within groups. The ave() function with diff() will do it compactly and efficiently: df - data.frame(TreeID=rep(1:5,each=4), Age=rep(seq(1,4,1),5)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA df$HeightGrowth - ave(df$Height, df$TreeID, FUN= function(vec) c(NA, diff(vec))) df TreeID Age Height HeightGrowth 1 1 1 1.105171 NA 2 1 2 1.3498590.2446879 3 1 3 1.6487210.2988625 4 1 4 2.0137530.3650314 5 2 1 1.105171 NA 6 2 2 1.3498590.2446879 7 2 3 1.6487210.2988625 8 2 4 2.0137530.3650314 9 3 1 1.105171 NA 10 3 2 1.3498590.2446879 11 3 3 1.6487210.2988625 12 3 4 2.0137530.3650314 13 4 1 1.105171 NA 14 4 2 1.3498590.2446879 15 4 3 1.6487210.2988625 16 4 4 2.0137530.3650314 17 5 1 1.105171 NA 18 5 2 1.3498590.2446879 19 5 3 1.6487210.2988625 20 5 4 2.0137530.3650314 (On my machine it was over six times as fast as the if-based code from Arun. ) -- David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
Sorry, it looks like I was hasty. Absent another dumb mistake, the following should do it. The request was for differences, i.e., the amount of growth from one period to the next, separately for each tree. for (ir in unique(df$TreeID)) { in.ir - df$TreeID == ir df$HeightGrowth[in.ir] - c(NA, diff(df$Height[in.ir])) } And this gives the same result as Rui Barradas' previous response. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 6/10/13 2:51 PM, MacQueen, Don macque...@llnl.gov wrote: How about for (ir in unique(df$TreeID)) { in.ir - df$TreeID == ir df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir]) } Seemed fast enough to me. In R, it is generally good to look for ways to operate on entire vectors or arrays, rather than element by element within them. The cumsum() function does that in this example. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote: I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
Well, speaking of hasty... This will also do it, provided that each tree's initial height is less than the previous tree's final height. In principle, not a safe assumption, but might be ok depending on where the data came from. df$delta - c(NA,diff(df$Height)) df$delta[df$delta 0] - NA -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 6/10/13 2:51 PM, MacQueen, Don macque...@llnl.gov wrote: How about for (ir in unique(df$TreeID)) { in.ir - df$TreeID == ir df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir]) } Seemed fast enough to me. In R, it is generally good to look for ways to operate on entire vectors or arrays, rather than element by element within them. The cumsum() function does that in this example. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote: I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up or alternative to 'For' loop
Hi, Some speed comparisons: df - data.frame(TreeID=rep(1:6000,each=20), Age=rep(seq(1,20,1),6000)) df$Height - exp(-0.1 + 0.2*df$Age) df1- df df3-df library(data.table) dt1- data.table(df) df$HeightGrowth - NA system.time({ #Rui's 2nd function df2 - data.matrix(df) for(i in seq_len(nrow(df2))[-1]){ if(df2[i, TreeID] == df2[i - 1, TreeID]) df2[i, HeightGrowth] - df2[i, Height] - df2[i - 1, Height] } }) # user system elapsed # 1.108 0.000 1.109 system.time({for (ir in unique(df$TreeID)) { #Don's first function in.ir - df$TreeID == ir df$HeightGrowth[in.ir] - c(NA, diff(df$Height[in.ir])) }}) # user system elapsed #100.004 0.704 100.903 system.time({df3$delta - c(NA,diff(df3$Height)) ##Don's 2nd function df3$delta[df3$delta 0] - NA}) #winner # user system elapsed # 0.016 0.000 0.014 system.time(df1$HeightGrowth - ave(df1$Height, df1$TreeID, FUN= function(vec) c(NA, diff(vec #David's #user system elapsed # 0.136 0.000 0.137 system.time(dt1[,HeightGrowth:=c(NA,diff(Height)),by=TreeID]) # user system elapsed # 0.076 0.000 0.079 identical(df1,as.data.frame(dt1)) #[1] TRUE identical(df1,df) #[1] TRUE head(df1,2) # TreeID Age Height HeightGrowth #1 1 1 1.105171 NA #2 1 2 1.349859 0.2446879 head(df2,2) # TreeID Age Height HeightGrowth #[1,] 1 1 1.105171 NA #[2,] 1 2 1.349859 0.2446879 A.K. - Original Message - From: Trevor Walker trevordaviswal...@gmail.com To: r-help@r-project.org Cc: Sent: Monday, June 10, 2013 1:28 PM Subject: [R] Speed up or alternative to 'For' loop I have a For loop that is quite slow and am wondering if there is a faster option: df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500)) df$Height - exp(-0.1 + 0.2*df$Age) df$HeightGrowth - NA #intialize with NA for (i in 2:nrow(df)) {if(df$TreeID[i]==df$TreeID[i-1]) {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1] } } Trevor Walker Email: trevordaviswal...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
Thank you all very much for your time and suggestions. The link to stackoverflow was very helpful. Here are some timings in case someone wants to know. (I noticed that microbenchmark results vary, depending on how many functions one tries to benchmark at a time. However, the min stays about the same) # just to refresh, most of the code is from stackoverflow link provided by Martin Morgan : http://stackoverflow.com/questions/16213029/more-efficient- strategy-for-which-or-match f0 - function(v) length(which(v 0)) f1 - function(v) sum(v 0) f2 - function(v) which.min(v 0) - 1L f3 - function(x) { # binary search implemented in R imin - 1L imax - length(x) while (imax = imin) { imid - as.integer(imin + (imax - imin) / 2) if (x[imid] = 0) imax - imid - 1L else imin - imid + 1L } imax } f3.c - cmpfun(f3) # pre-compiled # binary search in C f4 - cfunction(c(x = numeric), int imin = 0, imax = Rf_length(x) - 1, imid; while (imax = imin) { imid = imin + (imax - imin) / 2; if (REAL(x)[imid] = 0) imax = imid - 1; else imin = imid + 1; } return ScalarInteger(imax + 1); ) # this one is separate suggestion by William Dunlap : f5 - function(v) { tabulate(findInterval(v, c(-Inf, 0, 1, Inf)))[1] } vec - c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6)) # the identity of results was verified microbenchmark(f1(vec), f2(vec), f3(vec), f3.c(vec), f4(vec), f5(vec)) Unit: microseconds expr min lqmedian uq max neval f1(vec) 17054.233 17831.1385 18514.305 19512.4705 54603.435 100 f2(vec) 23624.353 25026.4265 26034.785 29322.1150 60014.458 100 f3(vec)76.90293.2340 111.834 116.8370 129.888 100 f3.c(vec)21.88330.753037.75754.125062.939 100 f4(vec) 6.57510.588530.38931.938537.610 100 f5(vec) 35365.088 36767.6175 38317.103 40671.2000 69209.425 100 So, i'll try to go with the inline binary search and see if I can precompile complex conditions. Thank you, again, for your help! Mikhail. On Friday, April 26, 2013 20:52:27 Suzen, Mehmet wrote: Hello Mikhail, I could suggest you to use ff package for fast access to large data structures: http://cran.r-project.org/web/packages/ff/index.html http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf Best Mehmet On 26 April 2013 18:12, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed of a vector operation question
Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
A very similar question was asked on StackOverflow (by Mikhail? and then I guess the answers there were somehow not satisfactory...) http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-match where it turns out that a binary search (implemented in R) on the sorted vector is much faster than sum, etc. I guess because it's log N without copying. The more complicated condition x .3 x .5 could be satisfied with multiple calls to the search. Martin On 04/26/2013 01:20 PM, William Dunlap wrote: I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed of a vector operation question
R's findInterval can also take advantage of a sorted x vector. E.g., in R-3.0.0 on the same 8-core Linux box: x - rexp(1e6, 2) system.time(for(i in 1:100)tabulate(findInterval(x, c(-Inf, .3, .5, Inf)))[2]) user system elapsed 2.444 0.000 2.446 xs - sort(x) system.time(for(i in 1:100)tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2]) user system elapsed 1.472 0.000 1.475 tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2] [1] 180636 sum( xs .3 xs = .5 ) [1] 180636 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: Martin Morgan [mailto:mtmor...@fhcrc.org] Sent: Friday, April 26, 2013 1:33 PM To: William Dunlap Cc: lcn; Mikhail Umorin; r-help@r-project.org Subject: Re: [R] speed of a vector operation question A very similar question was asked on StackOverflow (by Mikhail? and then I guess the answers there were somehow not satisfactory...) http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or- match where it turns out that a binary search (implemented in R) on the sorted vector is much faster than sum, etc. I guess because it's log N without copying. The more complicated condition x .3 x .5 could be satisfied with multiple calls to the search. Martin On 04/26/2013 01:20 PM, William Dunlap wrote: I think the sum way is the best. On my Linux machine running R-3.0.0 the sum way is slightly faster: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 4.664 0.340 5.018 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 5.017 0.160 5.186 If you are doing many of these counts on the same dataset you can save time by using functions like cut(), table(), ecdf(), and findInterval(). E.g., system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 5.332 0.568 5.909 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.500 0.008 0.511 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE You should do the timings yourself, as the relative speeds will depend on the version or dialect of the R interpreter and how it was compiled. E.g., with the current development version of 'TIBCO Enterprise Runtime for R' (aka 'TERR') on this same 8-core Linux box the sum way is considerably faster then the length(which) way: x - rexp(1e6, 2) system.time(for(i in 1:100)sum(x.3 x.5)) user system elapsed 1.870.030.48 system.time(for(i in 1:100)length(which(x.3 x.5))) user system elapsed 3.210.040.83 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) x=i), FUN.VALUE=0L)) user system elapsed 2.190.040.56 system.time(r2 - table(cut(x, seq(0,1,by=1/128 user system elapsed 0.270.010.13 all.equal(as.vector(r1), as.vector(r2)) [1] TRUE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of lcn Sent: Friday, April 26, 2013 12:09 PM To: Mikhail Umorin Cc: r-help@r-project.org Subject: Re: [R] speed of a vector operation question I think the sum way is the best. On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote: Hello, I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are sorted (with duplicates) in the vector (v). I am obtaining the length of vectors such as (v c) or (v c1 v c2), where c, c1, c2 are some scalar variables. What is the most efficient way to do this? I am using sum(v c) since TRUE's are 1's and FALSE's are 0's. This seems to me more efficient than length(which(v c)), but, please, correct me if I'm wrong. So, is there anything faster than what I already use? I'm running R 2.14.2 on Linux kernel 3.4.34. I appreciate your time, Mikhail [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read
[R] speed up merge
Hello, I have a nasty loop that I have to do 11877 times. The only thing that slows it down really is this merge: xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) Any ideas on how to speed it up? The output can't change materially (it works), but I'd like it to go faster. I'm looking at getting around the loop (not shown), but I'm trying to speed up the merge first. I'll post regarding the loop if nothing comes of this post. Here is some information on what type of stuff is going into the merge: class(ua_rd) [1] matrix dim(ua_rd) [1] 20 2 head(ua_rd) AName rt_date 2007-03-31 14066.580078125 2007-04-26 2007-06-30 14717 2007-07-19 2007-09-30 15528 2007-10-25 2007-12-31 17609 2008-01-24 2008-03-31 17168 2008-04-24 2008-06-30 17681 2008-07-17 class(dt) [1] character length(dt) [1] 1799 dt[1:10] [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04 2007-04-05 2007-04-06 2007-04-07 [9] 2007-04-08 2007-04-09 thanks, Ben [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up merge
Hi Ben, It seems you merge a matrix and a vector. As far as I understand the first thing merge does is convert these to data.frame. Is it possible to make the preceding steps give data frames? Regards, Kees On Fri, Mar 2, 2012 at 11:24 AM, Ben quant ccqu...@gmail.com wrote: Hello, I have a nasty loop that I have to do 11877 times. The only thing that slows it down really is this merge: xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) Any ideas on how to speed it up? The output can't change materially (it works), but I'd like it to go faster. I'm looking at getting around the loop (not shown), but I'm trying to speed up the merge first. I'll post regarding the loop if nothing comes of this post. Here is some information on what type of stuff is going into the merge: class(ua_rd) [1] matrix dim(ua_rd) [1] 20 2 head(ua_rd) AName rt_date 2007-03-31 14066.580078125 2007-04-26 2007-06-30 14717 2007-07-19 2007-09-30 15528 2007-10-25 2007-12-31 17609 2008-01-24 2008-03-31 17168 2008-04-24 2008-06-30 17681 2008-07-17 class(dt) [1] character length(dt) [1] 1799 dt[1:10] [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04 2007-04-05 2007-04-06 2007-04-07 [9] 2007-04-08 2007-04-09 thanks, Ben [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up merge
On Fri, Mar 02, 2012 at 03:24:20AM -0700, Ben quant wrote: Hello, I have a nasty loop that I have to do 11877 times. Are you completely sure about that? I often find my self avoiding loops-by-row by constructing vectors of which rows that fullfil a condition, and then creating new vectors out of that vector. If you elaborate on the problem, perhaps we could find a way to avoid the loops altogether? Mostly as a note to self, I wrote http://code.cjb.net/vectors-instead-of-loop.html, it might be understood by others too, but I'm not sure. -- Hans Ekbrand (http://sociologi.cjb.net) h...@sociologi.cjb.net __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up merge
I'm not sure. I'm still looking into it. Its pretty involved, so I asked the simplest answer first (the merge question). I'll reply back with a mock-up/sample that is testable under a more appropriate subject line. Probably this weekend. Regards, Ben On Fri, Mar 2, 2012 at 4:37 AM, Hans Ekbrand h...@sociologi.cjb.net wrote: On Fri, Mar 02, 2012 at 03:24:20AM -0700, Ben quant wrote: Hello, I have a nasty loop that I have to do 11877 times. Are you completely sure about that? I often find my self avoiding loops-by-row by constructing vectors of which rows that fullfil a condition, and then creating new vectors out of that vector. If you elaborate on the problem, perhaps we could find a way to avoid the loops altogether? Mostly as a note to self, I wrote http://code.cjb.net/vectors-instead-of-loop.html, it might be understood by others too, but I'm not sure. -- Hans Ekbrand (http://sociologi.cjb.net) h...@sociologi.cjb.net __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up merge
One way to speed up the merge is not to use merge. You can use 'match' to find matching indices and then manually. Does this do what you want: ua - read.table(text = ' AName rt_date + 2007-03-31 14066.580078125 2007-04-01 + 2007-06-30 14717 2007-04-03 + 2007-09-30 15528 2007-10-25 + 2007-12-31 17609 2008-04-06 + 2008-03-31 17168 2008-04-24 + 2008-06-30 17681 2008-04-09', header = TRUE, as.is = TRUE) dt - c( 2007-03-31 ,2007-04-01 ,2007-04-02, 2007-04-03 ,2007-04-04, + 2007-04-05 ,2007-04-06 ,2007-04-07, + 2007-04-08, 2007-04-09) # find matching values in ua indx - match(dt, ua$rt_date) # create new result matrix xx1 - cbind(dt, ua[indx,]) rownames(xx1) - NULL # delete funny names xx1 dtANamert_date 1 2007-03-31 NA NA 2 2007-04-01 14066.58 2007-04-01 3 2007-04-02 NA NA 4 2007-04-03 14717.00 2007-04-03 5 2007-04-04 NA NA 6 2007-04-05 NA NA 7 2007-04-06 NA NA 8 2007-04-07 NA NA 9 2007-04-08 NA NA 10 2007-04-09 NA NA On Fri, Mar 2, 2012 at 5:24 AM, Ben quant ccqu...@gmail.com wrote: Hello, I have a nasty loop that I have to do 11877 times. The only thing that slows it down really is this merge: xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) Any ideas on how to speed it up? The output can't change materially (it works), but I'd like it to go faster. I'm looking at getting around the loop (not shown), but I'm trying to speed up the merge first. I'll post regarding the loop if nothing comes of this post. Here is some information on what type of stuff is going into the merge: class(ua_rd) [1] matrix dim(ua_rd) [1] 20 2 head(ua_rd) AName rt_date 2007-03-31 14066.580078125 2007-04-26 2007-06-30 14717 2007-07-19 2007-09-30 15528 2007-10-25 2007-12-31 17609 2008-01-24 2008-03-31 17168 2008-04-24 2008-06-30 17681 2008-07-17 class(dt) [1] character length(dt) [1] 1799 dt[1:10] [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04 2007-04-05 2007-04-06 2007-04-07 [9] 2007-04-08 2007-04-09 thanks, Ben [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up merge
I'll have to give this a try this weekend. Thank you! ben On Fri, Mar 2, 2012 at 12:07 PM, jim holtman jholt...@gmail.com wrote: One way to speed up the merge is not to use merge. You can use 'match' to find matching indices and then manually. Does this do what you want: ua - read.table(text = ' AName rt_date + 2007-03-31 14066.580078125 2007-04-01 + 2007-06-30 14717 2007-04-03 + 2007-09-30 15528 2007-10-25 + 2007-12-31 17609 2008-04-06 + 2008-03-31 17168 2008-04-24 + 2008-06-30 17681 2008-04-09', header = TRUE, as.is = TRUE) dt - c( 2007-03-31 ,2007-04-01 ,2007-04-02, 2007-04-03 ,2007-04-04, + 2007-04-05 ,2007-04-06 ,2007-04-07, + 2007-04-08, 2007-04-09) # find matching values in ua indx - match(dt, ua$rt_date) # create new result matrix xx1 - cbind(dt, ua[indx,]) rownames(xx1) - NULL # delete funny names xx1 dtANamert_date 1 2007-03-31 NA NA 2 2007-04-01 14066.58 2007-04-01 3 2007-04-02 NA NA 4 2007-04-03 14717.00 2007-04-03 5 2007-04-04 NA NA 6 2007-04-05 NA NA 7 2007-04-06 NA NA 8 2007-04-07 NA NA 9 2007-04-08 NA NA 10 2007-04-09 NA NA On Fri, Mar 2, 2012 at 5:24 AM, Ben quant ccqu...@gmail.com wrote: Hello, I have a nasty loop that I have to do 11877 times. The only thing that slows it down really is this merge: xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T) Any ideas on how to speed it up? The output can't change materially (it works), but I'd like it to go faster. I'm looking at getting around the loop (not shown), but I'm trying to speed up the merge first. I'll post regarding the loop if nothing comes of this post. Here is some information on what type of stuff is going into the merge: class(ua_rd) [1] matrix dim(ua_rd) [1] 20 2 head(ua_rd) AName rt_date 2007-03-31 14066.580078125 2007-04-26 2007-06-30 14717 2007-07-19 2007-09-30 15528 2007-10-25 2007-12-31 17609 2008-01-24 2008-03-31 17168 2008-04-24 2008-06-30 17681 2008-07-17 class(dt) [1] character length(dt) [1] 1799 dt[1:10] [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04 2007-04-05 2007-04-06 2007-04-07 [9] 2007-04-08 2007-04-09 thanks, Ben [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up this algorithm (apply-fuction / 4D array)
here's another one - which is easier to generalize: x - array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91)) y - x [,,,1:90] # decide yourself what to do with slice 91, but # 91 is not divisible by 3 system.time ({ dim (y) - c (50, 50, 50, 3, 90 %/% 3) y - aperm (y, c (4, 1:3, 5)) v2 - colMeans (y) }) User System verstrichen 0.320.080.40 (my computer is a bit slower than Bill's:) system.time (v1 - f1 (x)) User System verstrichen 0.360 0.030 0.396 Claudia Am 05.10.2011 20:24, schrieb William Dunlap: I corrected your code a bit and put it into a function, f0, to make testing easier. I also made a small dataset to make testing easier. Then I made a new function f1 which does what f0 does in a vectorized manner: x- array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91)) xsmall- array(log(seq_len(2 * 2 * 2 * 91)), dim=c(2, 2, 2, 91)) f0- function(x) { data_reduced- array(0, dim=c(dim(x)[1:3], trunc(dim(x)[4]/3))) reduce- seq(1, dim(x)[4]-1, by=3) for( i in 1:length(reduce) ) { data_reduced[ , , , i]- apply(x[ , , , reduce[i] : (reduce[i]+2) ], 1:3, mean) } data_reduced } f1- function(x) { reduce- seq(1, dim(x)[4]-1, by=3) data_reduced- (x[, , , reduce] + x[, , , reduce+1] + x[, , , reduce+2]) / 3 data_reduced } The results were: system.time(v1- f1(x)) user system elapsed 0.280 0.040 0.323 system.time(v0- f0(x)) user system elapsed 73.760 0.060 73.867 all.equal(v0, v1) [1] TRUE I thought apply would already vectorize, rather than loop over every coordinate. No, you have that backwards. Use *apply functions when you cannot figure out how to vectorize. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Martin Batholdy Sent: Wednesday, October 05, 2011 10:40 AM To: R Help Subject: [R] speed up this algorithm (apply-fuction / 4D array) Hi, I have this sample-code (see above) and I was wondering wether it is possible to speed things up. What this code does is the following: x is 4D array (you can imagine it as x, y, z-coordinates and a time-coordinate). So x contains 50x50x50 data-arrays for 91 time-points. Now I want to reduce the 91 time-points. I want to merge three consecutive time points to one time-points by calculating the mean of this three time-points for every x,y,z coordinate. The reduce-sequence defines which time-points should get merged. And the apply-function in the for-loop calculates the mean of the three 3D-Arrays and puts them into a new 4D array (data_reduced). The problem is that even in this example it takes really long. I thought apply would already vectorize, rather than loop over every coordinate. But for my actual data-set it takes a really long time ... So I would be really grateful for any suggestions how to speed this up. x- array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91)) data_reduced- array(0, dim=c(50, 50, 50, 90/3)) reduce- seq(1,90, 3) for( i in 1:length(reduce) ) { data_reduced[ , , , i]-apply(x[ , , , reduce[i] : (reduce[i]+3) ], 1:3, mean) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Claudia Beleites Spectroscopy/Imaging Institute of Photonic Technology Albert-Einstein-Str. 9 07745 Jena Germany email: claudia.belei...@ipht-jena.de phone: +49 3641 206-133 fax: +49 2641 206-399 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed up this algorithm (apply-fuction / 4D array)
Hi, I have this sample-code (see above) and I was wondering wether it is possible to speed things up. What this code does is the following: x is 4D array (you can imagine it as x, y, z-coordinates and a time-coordinate). So x contains 50x50x50 data-arrays for 91 time-points. Now I want to reduce the 91 time-points. I want to merge three consecutive time points to one time-points by calculating the mean of this three time-points for every x,y,z coordinate. The reduce-sequence defines which time-points should get merged. And the apply-function in the for-loop calculates the mean of the three 3D-Arrays and puts them into a new 4D array (data_reduced). The problem is that even in this example it takes really long. I thought apply would already vectorize, rather than loop over every coordinate. But for my actual data-set it takes a really long time … So I would be really grateful for any suggestions how to speed this up. x - array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91)) data_reduced - array(0, dim=c(50, 50, 50, 90/3)) reduce - seq(1,90, 3) for( i in 1:length(reduce) ) { data_reduced[ , , , i]-apply(x[ , , , reduce[i] : (reduce[i]+3) ], 1:3, mean) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up this algorithm (apply-fuction / 4D array)
I corrected your code a bit and put it into a function, f0, to make testing easier. I also made a small dataset to make testing easier. Then I made a new function f1 which does what f0 does in a vectorized manner: x - array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91)) xsmall - array(log(seq_len(2 * 2 * 2 * 91)), dim=c(2, 2, 2, 91)) f0 - function(x) { data_reduced - array(0, dim=c(dim(x)[1:3], trunc(dim(x)[4]/3))) reduce - seq(1, dim(x)[4]-1, by=3) for( i in 1:length(reduce) ) { data_reduced[ , , , i] - apply(x[ , , , reduce[i] : (reduce[i]+2) ], 1:3, mean) } data_reduced } f1 - function(x) { reduce - seq(1, dim(x)[4]-1, by=3) data_reduced - (x[, , , reduce] + x[, , , reduce+1] + x[, , , reduce+2]) / 3 data_reduced } The results were: system.time(v1 - f1(x)) user system elapsed 0.280 0.040 0.323 system.time(v0 - f0(x)) user system elapsed 73.760 0.060 73.867 all.equal(v0, v1) [1] TRUE I thought apply would already vectorize, rather than loop over every coordinate. No, you have that backwards. Use *apply functions when you cannot figure out how to vectorize. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Martin Batholdy Sent: Wednesday, October 05, 2011 10:40 AM To: R Help Subject: [R] speed up this algorithm (apply-fuction / 4D array) Hi, I have this sample-code (see above) and I was wondering wether it is possible to speed things up. What this code does is the following: x is 4D array (you can imagine it as x, y, z-coordinates and a time-coordinate). So x contains 50x50x50 data-arrays for 91 time-points. Now I want to reduce the 91 time-points. I want to merge three consecutive time points to one time-points by calculating the mean of this three time-points for every x,y,z coordinate. The reduce-sequence defines which time-points should get merged. And the apply-function in the for-loop calculates the mean of the three 3D-Arrays and puts them into a new 4D array (data_reduced). The problem is that even in this example it takes really long. I thought apply would already vectorize, rather than loop over every coordinate. But for my actual data-set it takes a really long time ... So I would be really grateful for any suggestions how to speed this up. x - array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91)) data_reduced - array(0, dim=c(50, 50, 50, 90/3)) reduce - seq(1,90, 3) for( i in 1:length(reduce) ) { data_reduced[ , , , i]-apply(x[ , , , reduce[i] : (reduce[i]+3) ], 1:3, mean) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed Advice for R --- avoid data frames
On occasion, as pointed out in an earlier posting, it is efficient to convert to a matrix and when finished convert back to a data frame. The Hmisc package's asNumericMatrix and matrix2dataFrame functions assist by converting character variables to factors if needed, and by holding on to original attributes of variables in the data frame such as levels, then restoring the attributes. Frank Uwe Ligges-3 wrote: On 02.07.2011 21:35, ivo welch wrote: hi uwe---thanks for the clarification. of course, my example should always be done in vectorized form. I only used it to show how iterative access compares in the simplest possible fashion.100 accesses per seconds is REALLY slow, though. I don't know R internals and the learning curve would be steep. moreover, there is no guarantee that changes I would make would be accepted. so, I cannot do this. however, for an R expert, this should not be too difficult. conceptually, if data frame element access primitives are create/write/read/destroy in the code, then it's truly trivial. just add a matrix (dim the same as the data frame) of byte pointers to point at the storage upon creation/change time. this would be quick-and-dirty. for curiosity, do you know which source file has the data frame internals? maybe I will get tempted anyway if it is simple enough. I think you should start to look at the mechanisms to construct data.frames (such as data.frame) and learn that data.frames are special lists. Then you may want to look at the differences between the .Primitive([) and .Primitive([-) used for vectors (including vectors with dim attributes such as matrixes) and the correspoding methods for data.frames: [-.data.frame and [.data.frame. After that, I doubt you want to improve further on. Note also that data.frames can be pretty large and you really do not want to store a matrix of pointers as large as the data.frame. People working witrh large data.frames won't be happy with such a suggestion. If you want to follow up, I'd suggest to move the thread to R-devel where it seems to be more appropriate. Best, Uwe (a more efficient but more involved way to do this would be to store a data frame internally always as a matrix of data pointers, but this would probably require more surgery.) It is also not as important for me, as it is for others...to give a good impression to those that are not aware of the tradeoffs---which is most people considering to adopt R. /iaw Ivo Welch (ivo.we...@gmail.com) 2011/7/2 Uwe Liggeslt;lig...@statistik.tu-dortmund.degt; Some comments: the comparison matrix rows vs. matrix columns is incorrect: Note that R has lazy evaluation, hence you construct your matrix in the timing for the rows and it is already constructed in the timing for the columns, hence you want to use: M- matrix( rnorm(C*R), nrow=R ) D- as.data.frame(matrix( rnorm(C*R), nrow=R ) ) example(M) example(D) Further on, you are correct with you statement that data.frame indexing is much slower, but if you can store your data in matrix form, just go on as it is. I doubt anybody is really going to make the index operation you cited within a loop. Then, with a data.frame, I can live with many vectorized replacements again: system.time(D[,20]- sqrt(abs(D[,20])) + rnorm(1000)) user system elapsed 0.010.000.01 system.time(D[20,]- sqrt(abs(D[20,])) + rnorm(1000)) user system elapsed 0.510.000.52 OK, it would be nice to do that faster, but this is not easy. I think R Core is happy to see contributions to make it faster without breaking existing features. Best wishes, Uwe On 02.07.2011 20:35, ivo welch wrote: This email is intended for R users that are not that familiar with R internals and are searching google about how to speed up R. Despite common misperception, R is not slow when it comes to iterative access. R is fast when it comes to matrices. R is very slow when it comes to iterative access into data frames. Such access occurs when a user uses data$varname[index], which is a very common operation. To illustrate, run the following program: R- 1000; C- 1000 example- function(m) { cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]- sqrt(abs(m[r,20])) + rnorm(1) ), \n) cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]- sqrt(abs(m[20,c])) + rnorm(1)), \n) if (is.data.frame(m)) { cat(df: columns as names: ); cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) + rnorm(1)), \n) } } cat(\n Now as matrix\n) example( matrix( rnorm(C*R), nrow=R ) ) cat(\n Now as data frame\n) example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 with ample RAM: matrix, columns: 0.01s matrix, rows: 0.175s data frame, columns: 53s data frame, rows: 56s data frame, names: 58s
Re: [R] Speed Advice for R --- avoid data frames
On 02.07.2011 21:35, ivo welch wrote: hi uwe---thanks for the clarification. of course, my example should always be done in vectorized form. I only used it to show how iterative access compares in the simplest possible fashion.100 accesses per seconds is REALLY slow, though. I don't know R internals and the learning curve would be steep. moreover, there is no guarantee that changes I would make would be accepted. so, I cannot do this. however, for an R expert, this should not be too difficult. conceptually, if data frame element access primitives are create/write/read/destroy in the code, then it's truly trivial. just add a matrix (dim the same as the data frame) of byte pointers to point at the storage upon creation/change time. this would be quick-and-dirty. for curiosity, do you know which source file has the data frame internals? maybe I will get tempted anyway if it is simple enough. I think you should start to look at the mechanisms to construct data.frames (such as data.frame) and learn that data.frames are special lists. Then you may want to look at the differences between the .Primitive([) and .Primitive([-) used for vectors (including vectors with dim attributes such as matrixes) and the correspoding methods for data.frames: [-.data.frame and [.data.frame. After that, I doubt you want to improve further on. Note also that data.frames can be pretty large and you really do not want to store a matrix of pointers as large as the data.frame. People working witrh large data.frames won't be happy with such a suggestion. If you want to follow up, I'd suggest to move the thread to R-devel where it seems to be more appropriate. Best, Uwe (a more efficient but more involved way to do this would be to store a data frame internally always as a matrix of data pointers, but this would probably require more surgery.) It is also not as important for me, as it is for others...to give a good impression to those that are not aware of the tradeoffs---which is most people considering to adopt R. /iaw Ivo Welch (ivo.we...@gmail.com) 2011/7/2 Uwe Liggeslig...@statistik.tu-dortmund.de Some comments: the comparison matrix rows vs. matrix columns is incorrect: Note that R has lazy evaluation, hence you construct your matrix in the timing for the rows and it is already constructed in the timing for the columns, hence you want to use: M- matrix( rnorm(C*R), nrow=R ) D- as.data.frame(matrix( rnorm(C*R), nrow=R ) ) example(M) example(D) Further on, you are correct with you statement that data.frame indexing is much slower, but if you can store your data in matrix form, just go on as it is. I doubt anybody is really going to make the index operation you cited within a loop. Then, with a data.frame, I can live with many vectorized replacements again: system.time(D[,20]- sqrt(abs(D[,20])) + rnorm(1000)) user system elapsed 0.010.000.01 system.time(D[20,]- sqrt(abs(D[20,])) + rnorm(1000)) user system elapsed 0.510.000.52 OK, it would be nice to do that faster, but this is not easy. I think R Core is happy to see contributions to make it faster without breaking existing features. Best wishes, Uwe On 02.07.2011 20:35, ivo welch wrote: This email is intended for R users that are not that familiar with R internals and are searching google about how to speed up R. Despite common misperception, R is not slow when it comes to iterative access. R is fast when it comes to matrices. R is very slow when it comes to iterative access into data frames. Such access occurs when a user uses data$varname[index], which is a very common operation. To illustrate, run the following program: R- 1000; C- 1000 example- function(m) { cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]- sqrt(abs(m[r,20])) + rnorm(1) ), \n) cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]- sqrt(abs(m[20,c])) + rnorm(1)), \n) if (is.data.frame(m)) { cat(df: columns as names: ); cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) + rnorm(1)), \n) } } cat(\n Now as matrix\n) example( matrix( rnorm(C*R), nrow=R ) ) cat(\n Now as data frame\n) example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 with ample RAM: matrix, columns: 0.01s matrix, rows: 0.175s data frame, columns: 53s data frame, rows: 56s data frame, names: 58s Data frame access is about 5,000 times slower than matrix column access, and 300 times slower than matrix row access. R's data frame operational speed is an amazing 40 data accesses per seconds. I have not seen access numbers this low for decades. How to avoid it? Not easy. One way is to create multiple matrices, and group them as an object. of course, this loses a lot of features of R. Another way is to copy all data used in calculations out of the data frame into a matrix, do the operations, and then copy them back.
[R] Speed Advice for R --- avoid data frames
This email is intended for R users that are not that familiar with R internals and are searching google about how to speed up R. Despite common misperception, R is not slow when it comes to iterative access. R is fast when it comes to matrices. R is very slow when it comes to iterative access into data frames. Such access occurs when a user uses data$varname[index], which is a very common operation. To illustrate, run the following program: R - 1000; C - 1000 example - function(m) { cat(rows: ); cat(system.time( for (r in 1:R) m[r,20] - sqrt(abs(m[r,20])) + rnorm(1) ), \n) cat(columns: ); cat(system.time(for (c in 1:C) m[20,c] - sqrt(abs(m[20,c])) + rnorm(1)), \n) if (is.data.frame(m)) { cat(df: columns as names: ); cat(system.time(for (c in 1:C) m[[c]][20] - sqrt(abs(m[[c]][20])) + rnorm(1)), \n) } } cat(\n Now as matrix\n) example( matrix( rnorm(C*R), nrow=R ) ) cat(\n Now as data frame\n) example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 with ample RAM: matrix, columns: 0.01s matrix, rows: 0.175s data frame, columns: 53s data frame, rows: 56s data frame, names: 58s Data frame access is about 5,000 times slower than matrix column access, and 300 times slower than matrix row access. R's data frame operational speed is an amazing 40 data accesses per seconds. I have not seen access numbers this low for decades. How to avoid it? Not easy. One way is to create multiple matrices, and group them as an object. of course, this loses a lot of features of R. Another way is to copy all data used in calculations out of the data frame into a matrix, do the operations, and then copy them back. not ideal, either. In my opinion, this is an R design flow. Data frames are the fundamental unit of much statistical analysis, and should be fast. I think R lacks any indexing into data frames. Turning on indexing of data frames should at least be an optional feature. I hope this message post helps others. /iaw Ivo Welch (ivo.we...@gmail.com) http://www.ivo-welch.info/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed Advice for R --- avoid data frames
Some comments: the comparison matrix rows vs. matrix columns is incorrect: Note that R has lazy evaluation, hence you construct your matrix in the timing for the rows and it is already constructed in the timing for the columns, hence you want to use: M - matrix( rnorm(C*R), nrow=R ) D - as.data.frame(matrix( rnorm(C*R), nrow=R ) ) example(M) example(D) Further on, you are correct with you statement that data.frame indexing is much slower, but if you can store your data in matrix form, just go on as it is. I doubt anybody is really going to make the index operation you cited within a loop. Then, with a data.frame, I can live with many vectorized replacements again: system.time(D[,20] - sqrt(abs(D[,20])) + rnorm(1000)) user system elapsed 0.010.000.01 system.time(D[20,] - sqrt(abs(D[20,])) + rnorm(1000)) user system elapsed 0.510.000.52 OK, it would be nice to do that faster, but this is not easy. I think R Core is happy to see contributions to make it faster without breaking existing features. Best wishes, Uwe On 02.07.2011 20:35, ivo welch wrote: This email is intended for R users that are not that familiar with R internals and are searching google about how to speed up R. Despite common misperception, R is not slow when it comes to iterative access. R is fast when it comes to matrices. R is very slow when it comes to iterative access into data frames. Such access occurs when a user uses data$varname[index], which is a very common operation. To illustrate, run the following program: R- 1000; C- 1000 example- function(m) { cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]- sqrt(abs(m[r,20])) + rnorm(1) ), \n) cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]- sqrt(abs(m[20,c])) + rnorm(1)), \n) if (is.data.frame(m)) { cat(df: columns as names: ); cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) + rnorm(1)), \n) } } cat(\n Now as matrix\n) example( matrix( rnorm(C*R), nrow=R ) ) cat(\n Now as data frame\n) example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 with ample RAM: matrix, columns: 0.01s matrix, rows: 0.175s data frame, columns: 53s data frame, rows: 56s data frame, names: 58s Data frame access is about 5,000 times slower than matrix column access, and 300 times slower than matrix row access. R's data frame operational speed is an amazing 40 data accesses per seconds. I have not seen access numbers this low for decades. How to avoid it? Not easy. One way is to create multiple matrices, and group them as an object. of course, this loses a lot of features of R. Another way is to copy all data used in calculations out of the data frame into a matrix, do the operations, and then copy them back. not ideal, either. In my opinion, this is an R design flow. Data frames are the fundamental unit of much statistical analysis, and should be fast. I think R lacks any indexing into data frames. Turning on indexing of data frames should at least be an optional feature. I hope this message post helps others. /iaw Ivo Welch (ivo.we...@gmail.com) http://www.ivo-welch.info/ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed Advice for R --- avoid data frames
hi uwe---thanks for the clarification. of course, my example should always be done in vectorized form. I only used it to show how iterative access compares in the simplest possible fashion. 100 accesses per seconds is REALLY slow, though. I don't know R internals and the learning curve would be steep. moreover, there is no guarantee that changes I would make would be accepted. so, I cannot do this. however, for an R expert, this should not be too difficult. conceptually, if data frame element access primitives are create/write/read/destroy in the code, then it's truly trivial. just add a matrix (dim the same as the data frame) of byte pointers to point at the storage upon creation/change time. this would be quick-and-dirty. for curiosity, do you know which source file has the data frame internals? maybe I will get tempted anyway if it is simple enough. (a more efficient but more involved way to do this would be to store a data frame internally always as a matrix of data pointers, but this would probably require more surgery.) It is also not as important for me, as it is for others...to give a good impression to those that are not aware of the tradeoffs---which is most people considering to adopt R. /iaw Ivo Welch (ivo.we...@gmail.com) 2011/7/2 Uwe Ligges lig...@statistik.tu-dortmund.de Some comments: the comparison matrix rows vs. matrix columns is incorrect: Note that R has lazy evaluation, hence you construct your matrix in the timing for the rows and it is already constructed in the timing for the columns, hence you want to use: M - matrix( rnorm(C*R), nrow=R ) D - as.data.frame(matrix( rnorm(C*R), nrow=R ) ) example(M) example(D) Further on, you are correct with you statement that data.frame indexing is much slower, but if you can store your data in matrix form, just go on as it is. I doubt anybody is really going to make the index operation you cited within a loop. Then, with a data.frame, I can live with many vectorized replacements again: system.time(D[,20] - sqrt(abs(D[,20])) + rnorm(1000)) user system elapsed 0.010.000.01 system.time(D[20,] - sqrt(abs(D[20,])) + rnorm(1000)) user system elapsed 0.510.000.52 OK, it would be nice to do that faster, but this is not easy. I think R Core is happy to see contributions to make it faster without breaking existing features. Best wishes, Uwe On 02.07.2011 20:35, ivo welch wrote: This email is intended for R users that are not that familiar with R internals and are searching google about how to speed up R. Despite common misperception, R is not slow when it comes to iterative access. R is fast when it comes to matrices. R is very slow when it comes to iterative access into data frames. Such access occurs when a user uses data$varname[index], which is a very common operation. To illustrate, run the following program: R- 1000; C- 1000 example- function(m) { cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]- sqrt(abs(m[r,20])) + rnorm(1) ), \n) cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]- sqrt(abs(m[20,c])) + rnorm(1)), \n) if (is.data.frame(m)) { cat(df: columns as names: ); cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) + rnorm(1)), \n) } } cat(\n Now as matrix\n) example( matrix( rnorm(C*R), nrow=R ) ) cat(\n Now as data frame\n) example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) ) The following are the reported timing under R 2.12.0 on a Mac Pro 3,1 with ample RAM: matrix, columns: 0.01s matrix, rows: 0.175s data frame, columns: 53s data frame, rows: 56s data frame, names: 58s Data frame access is about 5,000 times slower than matrix column access, and 300 times slower than matrix row access. R's data frame operational speed is an amazing 40 data accesses per seconds. I have not seen access numbers this low for decades. How to avoid it? Not easy. One way is to create multiple matrices, and group them as an object. of course, this loses a lot of features of R. Another way is to copy all data used in calculations out of the data frame into a matrix, do the operations, and then copy them back. not ideal, either. In my opinion, this is an R design flow. Data frames are the fundamental unit of much statistical analysis, and should be fast. I think R lacks any indexing into data frames. Turning on indexing of data frames should at least be an optional feature. I hope this message post helps others. /iaw Ivo Welch (ivo.we...@gmail.com) http://www.ivo-welch.info/ __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed up an R code
Hello, Are there some basic things one can do to speed up a R code? I am new to R and currently going through the following situation. I have run a R code on two different machines. I have R 2.12 installed on both. Desktop 1 is slightly older and has a dual core processor with 4gigs of RAM. Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of RAM. Both run on Windows 7. I don't really see any significant speed up in the newer computer (Desktop 2). In the older one the program took around 5hrs 15 mins and in the newer one it took almost 4hrs 30mins. In the newer dekstop, R gives me the following: memory.limit() [1] 1024 memory.size() [1] 20.03 Is something hampering me here? Do I need to increase the limit and size? Can this change be made permanent? Or am I looking at the wrong place? I have never seen my R programs using much CPU or RAM when it runs? If this is not something inherent to R, then I guess I need to write more effiecient codes. Suggestions/solutions are welcome. Thanks, -Debs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up an R code
This is a very open ended question that depends very heavily on what you are trying to do and how you are doing it. Often times, the bottleneck operations that limit speed the most are not necessarily sped up by adding RAM. They also often require special setup to run multiple operations/iterations in parallel. Try some of the options at the High Performance Computing task view for specifics. http://cran.cnr.berkeley.edu/web/views/ HTH, Jon On Fri, May 27, 2011 at 2:00 PM, Debs Majumdar debs_st...@yahoo.com wrote: Hello, Are there some basic things one can do to speed up a R code? I am new to R and currently going through the following situation. I have run a R code on two different machines. I have R 2.12 installed on both. Desktop 1 is slightly older and has a dual core processor with 4gigs of RAM. Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of RAM. Both run on Windows 7. I don't really see any significant speed up in the newer computer (Desktop 2). In the older one the program took around 5hrs 15 mins and in the newer one it took almost 4hrs 30mins. In the newer dekstop, R gives me the following: memory.limit() [1] 1024 memory.size() [1] 20.03 Is something hampering me here? Do I need to increase the limit and size? Can this change be made permanent? Or am I looking at the wrong place? I have never seen my R programs using much CPU or RAM when it runs? If this is not something inherent to R, then I guess I need to write more effiecient codes. Suggestions/solutions are welcome. Thanks, -Debs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- === Jon Daily Technician === #!/usr/bin/env outside # It's great, trust me. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up an R code
Take a small subset of your program that would run through the critical sections and use ?Rprof to see where some of the hot spot are. How do you know it is not using the CPU? Are you using perfmon to look what is being used? Are you paging? If you are not paging, and not doing a lot of I/O, then you should tie up one CPU 100% if you are CPU bound. You probably need to put some output in your program to mark its progress. At a minimum, do the following: cat('I am here', proc.time(), '\n') By hcaning the initial string, you can see where you are and this is also reporting the user CPU, system CPU and elapsed time. This should be a good indication of where time is being spent. So there are a number of things you can do to instrument your code. If I had a program that was running for hours, I would definitely have something that tell me where I am at and how much time is being taken. If you have some large loop, then you could put out this information every 'n'th time through. The tag on the message would indicate this. There is also the progress bar that I use a lot to see if I amd making progress. After you have instrumented your code and have use Rprof, you might have some data that people would help you with. If you are using dataframe a lot, remember that indexing into them can be costly. Converting them to matrices, if appropriate, can give a big speed. Rprof will show you this. On Fri, May 27, 2011 at 2:00 PM, Debs Majumdar debs_st...@yahoo.com wrote: Hello, Are there some basic things one can do to speed up a R code? I am new to R and currently going through the following situation. I have run a R code on two different machines. I have R 2.12 installed on both. Desktop 1 is slightly older and has a dual core processor with 4gigs of RAM. Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of RAM. Both run on Windows 7. I don't really see any significant speed up in the newer computer (Desktop 2). In the older one the program took around 5hrs 15 mins and in the newer one it took almost 4hrs 30mins. In the newer dekstop, R gives me the following: memory.limit() [1] 1024 memory.size() [1] 20.03 Is something hampering me here? Do I need to increase the limit and size? Can this change be made permanent? Or am I looking at the wrong place? I have never seen my R programs using much CPU or RAM when it runs? If this is not something inherent to R, then I guess I need to write more effiecient codes. Suggestions/solutions are welcome. Thanks, -Debs __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up code with for() loop
On 29.04.2011 22:20, hck wrote: Barth sent me a very good code and I modified it a bit. Have a look: Error-rnorm(1000, mean=0, sd=0.05) estimate-(log(1+0.10)+Error) DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2 ))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1)) DCF_verzerrt-(1/(exp(estimate)-1)) S- 1000 # total sample size D- 1 # number of subsamples Subset- 1 # number in each subsample Select- matrix(sample(S,D*Subset,replace=TRUE),nrow=Subset,ncol=D) DCF_korrigiert_select- matrix(DCF_korrigiert[Select],nrow=Subset,ncol=D) Delta_ln-(log(colMeans(DCF_korrigiert_select, na.rm=T)/(1/0.10))) The only problem I discovered is that R cannot handle more than 2.147.483.647 integers, thus the cells in the matrix are bounded by this condition. (R shows the max by typing: .Machine$integer.max). And if you want to safe the workspace, the file with 10.000 times 10.000 becomes round 2 GB. Compared to the original of just 300 MB. So I cannot perform my previous bootstrap with 1.000.000 times 100.000. But nevertheless 10.000 times 10.000 seems to be sufficiently; I have to say its amazing, how fast the idea works. Has anybody a suggestion how to make it work for the 1.000.000 times 100.000 bootstrap??? Run it in several blocks of matrices with appropriate dimensions? This allows easy parallelization as well. Uwe Ligges -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3484548.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
If you are plotting that many data points, you might want to look at 'hexbin' as a way of aggregating the values to a different presentation. It is especially nice if you are doing a scatter plot with a lot of data points and trying to make sense out of it. On Wed, Apr 27, 2011 at 5:16 AM, Jonathan Gabris jonat...@k-m-p.nl wrote: Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to *sensible* values? Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of *less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up code with for() loop
Barth sent me a very good code and I modified it a bit. Have a look: Error-rnorm(1000, mean=0, sd=0.05) estimate-(log(1+0.10)+Error) DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2 ))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1)) DCF_verzerrt-(1/(exp(estimate)-1)) S - 1000 # total sample size D - 1 # number of subsamples Subset - 1 # number in each subsample Select - matrix(sample(S,D*Subset,replace=TRUE),nrow=Subset,ncol=D) DCF_korrigiert_select - matrix(DCF_korrigiert[Select],nrow=Subset,ncol=D) Delta_ln -(log(colMeans(DCF_korrigiert_select, na.rm=T)/(1/0.10))) The only problem I discovered is that R cannot handle more than 2.147.483.647 integers, thus the cells in the matrix are bounded by this condition. (R shows the max by typing: .Machine$integer.max). And if you want to safe the workspace, the file with 10.000 times 10.000 becomes round 2 GB. Compared to the original of just 300 MB. So I cannot perform my previous bootstrap with 1.000.000 times 100.000. But nevertheless 10.000 times 10.000 seems to be sufficiently; I have to say its amazing, how fast the idea works. Has anybody a suggestion how to make it work for the 1.000.000 times 100.000 bootstrap??? -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3484548.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed up code with for() loop
Hallo everybody, I'm wondering whether it might be possible to speed up the following code: Error-rnorm(1000, mean=0, sd=0.05) estimate-(log(1.1)-Error) DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1)) D-10 Delta_ln-rep(0,D) for(i in 1:D) Delta_ln[i]-(log(mean(sample(DCF_korrigiert,100,replace=TRUE))/(1/0.10))) The calculation of the for-loop takes several hours even on a very quick machine (4GHz, 8 GB RAM Windows 2008 Server 64bit). Has anybody an idea, how to improve the for-line? Thanks for helping me. Hans -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3481680.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up code with for() loop
Hans, You could parallelize it with the multicore package. The only other thing I can think of is to use calls to .Internal(). But be vigilant, as this might not be good advice. ?.Internal warns that only true R wizards should even consider using the function. First, an example with .Internal() calls, later mutlicore. For me, the following reduces elapsed time by about 9% on Windows 7 and by about 20% on today's new Ubuntu Natty. ## Set number of replicates n - 1 ## Your example set.seed(1) time.one - Sys.time() Error-rnorm(n, mean=0, sd=0.05) estimate-(log(1.1)-Error) DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1)) D-n Delta_ln-rep(0,D) for(i in 1:D) Delta_ln[i]-(log(mean(sample(DCF_korrigiert,D,replace=TRUE))/(1/0.10))) time.one - Sys.time() - time.one ## A few modifications with .Internal() set.seed(1) time.two - Sys.time() Error - rnorm(n, mean = 0, sd = 0.05) estimate - (log(1.1) - Error) DCF_korrigiert - (1 / (exp(1 / (exp(0.5 * (-estimate)^2 / (0.05^2)) * sqrt( 2* pi / (0.05^2)) * (1 - pnorm(0,((-estimate) / (0.05^2)), sqrt(1 / (0.05^2))-1)) D - n Delta_ln2 - numeric(length = D) Delta_ln2 - vapply(Delta_ln2, function(x) { log(.Internal(mean(DCF_korrigiert[.Internal( sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10)) }, FUN.VALUE = 1) time.two - Sys.time() - time.two ## Compare all.equal(Delta_ln, Delta_ln2) time.one time.two as.numeric(time.two) / as.numeric(time.one) Then you could parallelize it with multicore's parallel() function: ## Try multicore require(multicore) set.seed(1) time.three - Sys.time() Error - rnorm(n, mean = 0, sd = 0.05) estimate - (log(1.1) - Error) DCF_korrigiert - (1 / (exp(1 / (exp(0.5 * (-estimate)^2 / (0.05^2)) * sqrt( 2* pi / (0.05^2)) * (1 - pnorm(0,((-estimate) / (0.05^2)), sqrt(1 / (0.05^2))-1)) D - n/2 Delta_ln3 - numeric(length = D) Delta_ln3.1 - parallel(vapply(Delta_ln3, function(x) { log(.Internal(mean(DCF_korrigiert[.Internal( sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10)) }, FUN.VALUE = 1), mc.set.seed = T) Delta_ln3.2 - parallel(vapply(Delta_ln3, function(x) { log(.Internal(mean(DCF_korrigiert[.Internal( sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10)) }, FUN.VALUE = 1), mc.set.seed = T) results - collect(list(Delta_ln3.1, Delta_ln3.2)) names(results) - NULL Delta_ln3 - do.call(append, results) time.three - Sys.time() - time.three ## Compare # Results won't be equal due to the different way # parallel() handles set.seed() randomization all.equal(Delta_ln, Delta_ln3) time.one time.two time.three as.numeric(time.three) / as.numeric(time.one) Combining parallel() with the .Internal calls reduces the elapsed time by about 70% on Ubuntu Natty. Multicore is not available for Windows, or at least not easily available for Windows. But maybe the true R wizards have better ideas. Jeremy __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed up plotting to MSWindows graphics window
Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to *sensible* values? Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of *less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
Jonathan Gabris wrote: Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to *sensible* values? Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I don't think there are any ways to plot in the standard device that are significantly faster than what you are doing if you want to see the updates. (I think it would be substantially faster if you hid the graphics window during the updates, but that won't suit you.) I'd suggest plotting a subset of the data during the updates, then plot the full dataset when it stops moving. For example, only plot a few hundred points, even spaced through the time series. Duncan Murdoch I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of *less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
Date: Wed, 27 Apr 2011 11:16:26 +0200 From: jonat...@k-m-p.nl To: r-help@r-project.org Subject: [R] Speed up plotting to MSWindows graphics window Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to *sensible* values? Well, hard to know but it would help to know where all the time is going. Usually people start complaining when VM thrashing is common but if you are CPU limited you could try restricting the range of data you want to plot rather than relying on the plot to just clip the largely irrelevant points when you are zoomed in. It should not be too expensive to find the limits either incrementally or with binary search on ordered time series. Presumably subsetting is fast using foo[a:b,] One thing you may want to try for change of scale is wavelet or multi-resolution analysis. You can make a tree ( increasing memory usage but even VM here may not be a big penalty if coherence is high ) and display the resolution appropriate for the current scale. Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of *less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
On 27.04.2011 12:56, Duncan Murdoch wrote: Jonathan Gabris wrote: Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to *sensible* values? Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I don't think there are any ways to plot in the standard device that are significantly faster than what you are doing if you want to see the updates. (I think it would be substantially faster if you hid the graphics window during the updates, but that won't suit you.) I'd suggest plotting a subset of the data during the updates, then plot the full dataset when it stops moving. For example, only plot a few hundred points, even spaced through the time series. ... and it highly depends on the data what can be improved. Example: For signals essential consisting of sine functions (i.e. harmonic signals), I am using a little dirty trick in the tuneR package, but that makes the assumption of having a high frequency sample of a harmonic signal without too much noise. Uwe Ligges Duncan Murdoch I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of *less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
On 27/04/2011 13:18, Mike Marchywka wrote: Date: Wed, 27 Apr 2011 11:16:26 +0200 From:jonat...@k-m-p.nl To:r-help@r-project.org Subject: [R] Speed up plotting to MSWindows graphics window Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am using R 2.12 on Windows XP and 7 I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to*sensible* values? Well, hard to know but it would help to know where all the time is going. Usually people start complaining when VM thrashing is common but if you are CPU limited you could try restricting the range of data you want to plot rather than relying on the plot to just clip the largely irrelevant points when you are zoomed in. It should not be too expensive to find the limits either incrementally or with binary search on ordered time series. Presumably subsetting is fast using foo[a:b,] One thing you may want to try for change of scale is wavelet or multi-resolution analysis. You can make a tree ( increasing memory usage but even VM here may not be a big penalty if coherence is high ) and display the resolution appropriate for the current scale. I forgot to add, for plotting I use a command similar to: plot.ts(timestampVector, dataVector, xlim=c(a,b)) a and b are timestamps from timestampVector Is the xlim parameter sufficient for limiting the scope of the plots? Or should I subset the timeseries each time I do a plot? The multi-resolution analysis looks interesting. I shall spend some time finding out how to use the wavelets package. Cheers! Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of*less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up plotting to MSWindows graphics window
Date: Wed, 27 Apr 2011 14:40:23 +0200 From: jonat...@k-m-p.nl To: r-help@r-project.org Subject: Re: [R] Speed up plotting to MSWindows graphics window On 27/04/2011 13:18, Mike Marchywka wrote: Date: Wed, 27 Apr 2011 11:16:26 +0200 From:jonat...@k-m-p.nl To:r-help@r-project.org Subject: [R] Speed up plotting to MSWindows graphics window Hello, I am working on a project analysing the performance of motor-vehicles through messages logged over a CAN bus. I am currently plotting the data in R, overlaying 5 or more plots of data, logged at 1kHz, (using plot.ts() and par(new = TRUE)). The aim is to be able to pan, zoom in and out and get values from the plotted graph using a custom Qt interface that is used as a front end to R.exe (all this works). The plot is drawn by R directly to the windows graphic device. The data is imported from a .csv file (typically around 100MB) to a matrix. (timestamp, message ID, byte0, byte1, ..., byte7) I then separate this matrix into several by message ID (dimensions are in the order of 8cols, 10^6 rows) The panning is done by redrawing the plots, shifted by a small amount. So as to view a window of data from a second to a minute long that can travel the length of the logged data. My problem is that, the redrawing of the plots whilst panning is too slow when dealing with this much data. i.e.: I can see the last graphs being drawn to the screen in the half-second following the view change. I need a fluid change from one view to the next. My question is this: Are there ways to speed up the plotting on the MSWindows display? By reducing plotted point densities to*sensible* values? Well, hard to know but it would help to know where all the time is going. Usually people start complaining when VM thrashing is common but if you are CPU limited you could try restricting the range of data you want to plot rather than relying on the plot to just clip the largely irrelevant points when you are zoomed in. It should not be too expensive to find the limits either incrementally or with binary search on ordered time series. Presumably subsetting is fast using foo[a:b,] One thing you may want to try for change of scale is wavelet or multi-resolution analysis. You can make a tree ( increasing memory usage but even VM here may not be a big penalty if coherence is high ) and display the resolution appropriate for the current scale. I forgot to add, for plotting I use a command similar to: plot.ts(timestampVector, dataVector, xlim=c(a,b)) a and b are timestamps from timestampVector Is the xlim parameter sufficient for limiting the scope of the plots? Or should I subset the timeseries each time I do a plot? well, maybe time series knows the data to be ordered, I never use that, but in general it has to go check each point and clip the out of range ones. It could I suppose binary search for start/end points but I don't know.Based on what you said it sounds like it does not. The multi-resolution analysis looks interesting. I shall spend some time finding out how to use the wavelets package. Cheers! Using something other than plot.ts() - is the lattice package faster? I don't need publication quality plots, they can be rougher... I have tried: -Using matrices instead of dataframes - (works for calculations but not enough for plots) -increasing the max usable memory (max-mem-size) - (no change) -increasing the size of the pointer protection stack (max-ppsize) - (no change) -deleting the unnecessary leftover matrices - (no change) -I can't use lines() instead of plot() because of the very different scales (rpm-1, flags -1to3) I am going to do some resampling of the logged data to reduce the vector sizes. (removal of*less* important data and use of window.ts()) But I am currently running out of ideas... So if sombody could point out something, I would be greatfull. Thanks, Jonathan Gabris __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting
Re: [R] Speed up sum of outer products?
Hi Stefan, thats really interesting - I never though of trying to benchmark Linux-64 against OSX (a friend who works on large databases, says OSX performs better than Linux in his work!). Thanks for posting your comparison, and your hints :) i) I guess you have a very fast CPU (Core i7 or so, I guess?), - only quad core i5 but I'm trying to get access to a quad core i7, might make a difference for openCL code? ii) a very poor BLAS implementation - I installed the latest ATLAS package for Ubuntu 10.04 LTS, which gives a x6 speed up?? I'm tempted interested in recompiling R-2.12.2 linked to the MKL (which I guess the vecLib BLAS library uses ?), but it seems a tricky thing to do ?? To be honest I'm not sure how this new ATLAS library works, i.e. is it seqential or mulithtreaded? iii) and a desktop graphics card - installed a GTX570 today which has 480 cuda cores, my previous card had 16 cores and half the bandwidth The results of a setup with the new ATLAS library and GTX570 are a pleasant improvement :). user system elapsed-- for loop, single thread 29.790 7.400 37.243 user system elapsed -- new ATLAS, t(X)%*%X 1.480 0.000 1.479 user system elapsed -- new ATLAS, crossprod(X) 0.740 0.000 0.739 user system elapsed -- new GPU, gputools::crossprod(X) * 0.190 0.040 0.228* I would be really interested to find out what the results would be on a OSX machine with a fancy GPU. I read that a 2x512 core card is going to be released by Nvidia in the next couple of weeks, and CUDA 4.0 is due for public release in a few months. So may be you want to keep CUDA on your radar? I managed to write my first R function/package using CUDA code at the weekend. Its a fairly simple but tedious process once you have some CUDA code which compiles, and all you want to do is to port it to R. (in the Unix case at least). For example you can write a simple C wrapper along the lines of the rinterface.c code in gputools. Then modify the Makefile.in and configure.ac files in this package as required, and you should be set to configure, make and install into R. I'm working on non-parametric regression, and optimization at the moment and the speed up using CUDA has been worth the effort :) All the best, Ajay On 15 March 2011 11:22, Stefan Evert-3 [via R] ml-node+3356302-1299160144-215...@n4.nabble.com wrote: Hi Ajay, thanks for this comparison, which prodded me to give CUDA another try on my now somewhat aging MacBook Pro. Hi Dennis, sorry for the delayed reply and thanks for the article. I digged into it and found that if you have a GPU, the CUBLAS library beats the BLAS/ATLAS implementation in the Matrix package for 'large' problems. I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor BLAS implementation and a desktop graphics card? user system elapsed-- for loop, single thread 27.210 6.680 33.342 user system elapsed-- BLAS mat mult 6.260 0.000 5.982 user system elapsed-- BLAS crossprod 4.340 0.000 4.284 user system elapsed-- CUDA gpuCrossprod 1.490.001.48 Just to put these numbers in perspective, here are my results for a MacBook Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia GeForce 8600M GT with 512 MB RAM -- I suppose it's the M that breaks my performance here). user system elapsed-- for loop, single thread 141.034 35.299 153.783 user system elapsed-- BLAS mat mult 2.791 0.025 1.805 user system elapsed-- BLAS crossprod 1.419 0.039 0.863 user system elapsed-- CUDA gpuCrossprod 1.431 0.119 1.718 As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is slightly slower (my card has 32 cores, but may have lower memory bandwidth and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by a factor of 2. Kudos to the gputools developers: despite what the README says, the package compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release 3.2. Thanks for this convenient package! Best regards, Stefan Evert [ [hidden email]http://user/SendEmail.jtp?type=nodenode=3356302i=0by-user=t| http://purl.org/stefan.evert ] __ [hidden email]http://user/SendEmail.jtp?type=nodenode=3356302i=1by-user=tmailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- If you reply to this email, your message will be added to the discussion below: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3356302.html To unsubscribe from Speed up sum of outer products?, click
Re: [R] Speed up sum of outer products?
Hi Dennis, sorry for the delayed reply and thanks for the article. I digged into it and found that if you have a GPU, the CUBLAS library beats the BLAS/ATLAS implementation in the Matrix package for 'large' problems. Here's what I mean, its = 2500 dim = 1750 X = matrix(rnorm(its*dim),its, dim) system.time({C=matrix(0, dim, dim);for(i in 1:its)C = C + (X[i,] %o% X[i,])}) # single thread breakup calculation system.time({C1 = t(X) %*% X}) # single thread - BLAS matrix mult system.time({C2 = crossprod(X)}) # single thread - BLAS matrix mult library(gputools) system.time({C3 = gpuCrossprod(X, X)}) # multithread - CUBLAS cublasSgemm function print(all.equal(C,C1,C2,C3)) user system elapsed 27.210 6.680 33.342 user system elapsed 6.260 0.000 5.982 user system elapsed 4.340 0.000 4.284 user system elapsed 1.490.001.48 [1] TRUE The last line shows a x3 speed up, using my dated graphics card which has 16 cores, compared to my cpu which is a quad core. I should be able to try this out on a 512 core card in the next few days, and will post the result. All the best, Aj -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3355139.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up sum of outer products?
Hi Ajay, thanks for this comparison, which prodded me to give CUDA another try on my now somewhat aging MacBook Pro. Hi Dennis, sorry for the delayed reply and thanks for the article. I digged into it and found that if you have a GPU, the CUBLAS library beats the BLAS/ATLAS implementation in the Matrix package for 'large' problems. I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor BLAS implementation and a desktop graphics card? user system elapsed-- for loop, single thread 27.210 6.680 33.342 user system elapsed-- BLAS mat mult 6.260 0.000 5.982 user system elapsed-- BLAS crossprod 4.340 0.000 4.284 user system elapsed-- CUDA gpuCrossprod 1.490.001.48 Just to put these numbers in perspective, here are my results for a MacBook Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia GeForce 8600M GT with 512 MB RAM -- I suppose it's the M that breaks my performance here). user system elapsed-- for loop, single thread 141.034 35.299 153.783 user system elapsed-- BLAS mat mult 2.791 0.025 1.805 user system elapsed-- BLAS crossprod 1.419 0.039 0.863 user system elapsed-- CUDA gpuCrossprod 1.431 0.119 1.718 As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is slightly slower (my card has 32 cores, but may have lower memory bandwidth and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by a factor of 2. Kudos to the gputools developers: despite what the README says, the package compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release 3.2. Thanks for this convenient package! Best regards, Stefan Evert [ stefan.ev...@uos.de | http://purl.org/stefan.evert ] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Speed up sum of outer products?
Hi, I'm new to R and stats, and I'm trying to speed up the following sum, for (i in 1:n){ C = C + (X[i,] %o% X[i,]) # the sum of outer products - this is very slow according to Rprof() } where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to be calculated over 10,000 times for different X. I think it is similar to estimating a co-variance matrix for demeaned data X. I tried using cov, but got different answers, and it was'nt much quicker? Any help gratefully appreciated, -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330160.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up sum of outer products?
What you're doing is breaking up the calculation of X'X into n steps. I'm not sure what you mean by very slow: X = matrix(rnorm(1000*50),1000,50) n = 1000 system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])}) user system elapsed 0.096 0.008 0.104 Of course, you could just do the calculation directly: system.time({C1 = t(X) %*% X}) user system elapsed 0.008 0.000 0.007 all.equal(C,C1) [1] TRUE - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Tue, 1 Mar 2011, AjayT wrote: Hi, I'm new to R and stats, and I'm trying to speed up the following sum, for (i in 1:n){ C = C + (X[i,] %o% X[i,]) # the sum of outer products - this is very slow according to Rprof() } where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to be calculated over 10,000 times for different X. I think it is similar to estimating a co-variance matrix for demeaned data X. I tried using cov, but got different answers, and it was'nt much quicker? Any help gratefully appreciated, -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330160.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up sum of outer products?
Isn't the following the canonical (R-ish) way of doing this: X = matrix(rnorm(1000*50),1000,50) system.time({C1 = t(X) %*% X}) # Phil's example C2 - crossprod(X) # use crossprod instead all.equal(C1,C2) [1] TRUE -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Phil Spector Sent: Tuesday, March 01, 2011 12:31 PM To: AjayT Cc: r-help@r-project.org Subject: Re: [R] Speed up sum of outer products? What you're doing is breaking up the calculation of X'X into n steps. I'm not sure what you mean by very slow: X = matrix(rnorm(1000*50),1000,50) n = 1000 system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])}) user system elapsed 0.096 0.008 0.104 Of course, you could just do the calculation directly: system.time({C1 = t(X) %*% X}) user system elapsed 0.008 0.000 0.007 all.equal(C,C1) [1] TRUE - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Tue, 1 Mar 2011, AjayT wrote: Hi, I'm new to R and stats, and I'm trying to speed up the following sum, for (i in 1:n){ C = C + (X[i,] %o% X[i,]) # the sum of outer products - this is very slow according to Rprof() } where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to be calculated over 10,000 times for different X. I think it is similar to estimating a co-variance matrix for demeaned data X. I tried using cov, but got different answers, and it was'nt much quicker? Any help gratefully appreciated, -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of- outer-products-tp3330160p3330160.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up sum of outer products?
Hey thanks alot guys !!! That really speeds things up !!! I didn't know %*% and crossprod, could operate on matrices. I think you've saved me hours in calculation time. Thanks again. system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])}) user system elapsed 0.450.000.90 system.time({C1 = t(X) %*% X}) user system elapsed 0.020.000.05 system.time({C2 = crossprod(X)}) user system elapsed 0.020.000.02 -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330378.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Speed up sum of outer products?
...and this is where we cue the informative article on least squares calculations in R by Doug Bates: http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf HTH, Dennis On Tue, Mar 1, 2011 at 10:52 AM, AjayT ajaytal...@googlemail.com wrote: Hey thanks alot guys !!! That really speeds things up !!! I didn't know %*% and crossprod, could operate on matrices. I think you've saved me hours in calculation time. Thanks again. system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])}) user system elapsed 0.450.000.90 system.time({C1 = t(X) %*% X}) user system elapsed 0.020.000.05 system.time({C2 = crossprod(X)}) user system elapsed 0.020.000.02 -- View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330378.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed up process
Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar - 2 seq.yvar - 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos - c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k - seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1 - structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = data.frame) mydata2 - mydata1[!(mydata1$species %in% c(thgel,alsen)),] mydata3 - mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),] mydata_list - list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3) #function for regression library(WRS) foo_reg - function(dat, xvar, yvar, mycol, pos, name.dat){ tsts - tstsreg(dat[[xvar]], dat[[yvar]]) tsts_inter - signif(tsts$coef[1], digits=3) tsts_slope - signif(tsts$coef[2], digits=3) abline(tsts$coef, lty=1, col=mycol) legend(x=pos, legend=c(paste(TSTS ,name.dat,: Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol) } -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up process
Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar - 2 seq.yvar - 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos - c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k - seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1 - structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = data.frame) mydata2 - mydata1[!(mydata1$species %in% c(thgel,alsen)),] mydata3 - mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),] mydata_list - list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3) #function for regression library(WRS) foo_reg - function(dat, xvar, yvar, mycol, pos, name.dat){ tsts - tstsreg(dat[[xvar]], dat[[yvar]]) tsts_inter - signif(tsts$coef[1], digits=3) tsts_slope - signif(tsts$coef[2], digits=3) abline(tsts$coef, lty=1, col=mycol) legend(x=pos, legend=c(paste(TSTS ,name.dat,: Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol) } -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up process
Thanks Nick for your quick answer. It does work (no missed bracket!) but unfortunately doesn't really speed up anything: with my real data, it takes 82.78 seconds with the double lapply() instead of 83.59s with the double loop (about 0.8 s). It looks like my double loop was not that bad. Does anyone know another faster way to do this? Thanks again in advance, Ivan Le 2/25/2011 11:41, Nick Sabbe a écrit : Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar- 2 seq.yvar- 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos- c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k- seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1- structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = data.frame) mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),] mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),] mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3) #function for regression library(WRS) foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){ tsts- tstsreg(dat[[xvar]], dat[[yvar]]) tsts_inter- signif(tsts$coef[1], digits=3) tsts_slope- signif(tsts$coef[2], digits=3) abline(tsts$coef, lty=1, col=mycol) legend(x=pos, legend=c(paste(TSTS ,name.dat,: Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol) } -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up process
use Rprof to find where time is being spent. probably in 'plot' which might imply it is not the 'for' loop and therefore beyond your control. Sent from my iPad On Feb 25, 2011, at 6:19, Ivan Calandra ivan.calan...@uni-hamburg.de wrote: Thanks Nick for your quick answer. It does work (no missed bracket!) but unfortunately doesn't really speed up anything: with my real data, it takes 82.78 seconds with the double lapply() instead of 83.59s with the double loop (about 0.8 s). It looks like my double loop was not that bad. Does anyone know another faster way to do this? Thanks again in advance, Ivan Le 2/25/2011 11:41, Nick Sabbe a écrit : Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar- 2 seq.yvar- 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos- c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k- seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1- structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = data.frame) mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),] mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),] mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3) #function for regression library(WRS) foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){ tsts- tstsreg(dat[[xvar]], dat[[yvar]]) tsts_inter- signif(tsts$coef[1], digits=3) tsts_slope- signif(tsts$coef[2], digits=3) abline(tsts$coef, lty=1, col=mycol) legend(x=pos, legend=c(paste(TSTS ,name.dat,: Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol) } -- Ivan CALANDRA PhD Student University of Hamburg Biozentrum Grindel und Zoologisches Museum Abt. Säugetiere Martin-Luther-King-Platz 3 D-20146 Hamburg, GERMANY +49(0)40 42838 6231 ivan.calan...@uni-hamburg.de ** http://www.for771.uni-bonn.de http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman
Re: [R] speed up process
Dear Jim, I've tried to use Rprof() as you advised me, but I don't understand how it works. I've done this: Rprof(for (i in seq_along(seq.yvar)){ all_my_commands }) summaryRprof() But I got this error: Error in summaryRprof() : no lines found in ‘Rprof.out’ I couldn't really understand from the help page what I should do. In any case, it's sure that the function tstsreg(), is what takes the most computing time. But I wanted to optimize the rest of the code to gain as much speed as possible. Ivan Le 2/25/2011 12:30, Jim Holtman a écrit : use Rprof to find where time is being spent. probably in 'plot' which might imply it is not the 'for' loop and therefore beyond your control. Sent from my iPad On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de wrote: Thanks Nick for your quick answer. It does work (no missed bracket!) but unfortunately doesn't really speed up anything: with my real data, it takes 82.78 seconds with the double lapply() instead of 83.59s with the double loop (about 0.8 s). It looks like my double loop was not that bad. Does anyone know another faster way to do this? Thanks again in advance, Ivan Le 2/25/2011 11:41, Nick Sabbe a écrit : Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar- 2 seq.yvar- 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos- c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k- seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1- structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = data.frame) mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),] mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),] mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3) #function for regression library(WRS) foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){ tsts- tstsreg(dat[[xvar]], dat[[yvar]]) tsts_inter- signif(tsts$coef[1], digits=3) tsts_slope- signif(tsts$coef[2], digits=3) abline(tsts$coef, lty=1, col=mycol) legend(x=pos, legend=c(paste(TSTS ,name.dat,: Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col
Re: [R] speed up process
You invoke Rprof, run your code and then terminate it: Rprof() ... code you want to profile Rprof(NULL) # generate output summaryRprof() example: Rprof() for (i in 1:1e6) sin(i) + cos(i) + sqrt(i) Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct sin 0.2430.77 0.24 30.77 sqrt 0.2228.21 0.22 28.21 cos 0.1620.51 0.16 20.51 + 0.1417.95 0.14 17.95 : 0.02 2.56 0.02 2.56 $by.total total.time total.pct self.time self.pct sin0.24 30.77 0.2430.77 sqrt 0.22 28.21 0.2228.21 cos0.16 20.51 0.1620.51 + 0.14 17.95 0.1417.95 : 0.02 2.56 0.02 2.56 $sample.interval [1] 0.02 $sampling.time [1] 0.78 On Fri, Feb 25, 2011 at 6:57 AM, Ivan Calandra ivan.calan...@uni-hamburg.de wrote: Dear Jim, I've tried to use Rprof() as you advised me, but I don't understand how it works. I've done this: Rprof(for (i in seq_along(seq.yvar)){ all_my_commands }) summaryRprof() But I got this error: Error in summaryRprof() : no lines found in ‘Rprof.out’ I couldn't really understand from the help page what I should do. In any case, it's sure that the function tstsreg(), is what takes the most computing time. But I wanted to optimize the rest of the code to gain as much speed as possible. Ivan Le 2/25/2011 12:30, Jim Holtman a écrit : use Rprof to find where time is being spent. probably in 'plot' which might imply it is not the 'for' loop and therefore beyond your control. Sent from my iPad On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de wrote: Thanks Nick for your quick answer. It does work (no missed bracket!) but unfortunately doesn't really speed up anything: with my real data, it takes 82.78 seconds with the double lapply() instead of 83.59s with the double loop (about 0.8 s). It looks like my double loop was not that bad. Does anyone know another faster way to do this? Thanks again in advance, Ivan Le 2/25/2011 11:41, Nick Sabbe a écrit : Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar- 2 seq.yvar- 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos- c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k- seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k)) Thanks in advance for any hints. Ivan #create data (it looks horrible with these datasets but it doesn't matter here) mydata1- structure(list(species = structure(1:8, .Label = c(alsen, gogor, loalb
Re: [R] speed up process
Ha... it was way too simple! I thought it would be like system.time()... my bad. Thanks for the tip! As we thought, foo_reg() takes most of the computing time, and I cannot improve that. Any ideas of how to improve the rest? Thanks again for your help Ivan Le 2/25/2011 14:29, jim holtman a écrit : You invoke Rprof, run your code and then terminate it: Rprof() ... code you want to profile Rprof(NULL) # generate output summaryRprof() example: Rprof() for (i in 1:1e6) sin(i) + cos(i) + sqrt(i) Rprof(NULL) summaryRprof() $by.self self.time self.pct total.time total.pct sin 0.2430.77 0.24 30.77 sqrt 0.2228.21 0.22 28.21 cos 0.1620.51 0.16 20.51 + 0.1417.95 0.14 17.95 : 0.02 2.56 0.02 2.56 $by.total total.time total.pct self.time self.pct sin0.24 30.77 0.2430.77 sqrt 0.22 28.21 0.2228.21 cos0.16 20.51 0.1620.51 + 0.14 17.95 0.1417.95 : 0.02 2.56 0.02 2.56 $sample.interval [1] 0.02 $sampling.time [1] 0.78 On Fri, Feb 25, 2011 at 6:57 AM, Ivan Calandra ivan.calan...@uni-hamburg.de wrote: Dear Jim, I've tried to use Rprof() as you advised me, but I don't understand how it works. I've done this: Rprof(for (i in seq_along(seq.yvar)){ all_my_commands }) summaryRprof() But I got this error: Error in summaryRprof() : no lines found in ‘Rprof.out’ I couldn't really understand from the help page what I should do. In any case, it's sure that the function tstsreg(), is what takes the most computing time. But I wanted to optimize the rest of the code to gain as much speed as possible. Ivan Le 2/25/2011 12:30, Jim Holtman a écrit : use Rprof to find where time is being spent. probably in 'plot' which might imply it is not the 'for' loop and therefore beyond your control. Sent from my iPad On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de wrote: Thanks Nick for your quick answer. It does work (no missed bracket!) but unfortunately doesn't really speed up anything: with my real data, it takes 82.78 seconds with the double lapply() instead of 83.59s with the double loop (about 0.8 s). It looks like my double loop was not that bad. Does anyone know another faster way to do this? Thanks again in advance, Ivan Le 2/25/2011 11:41, Nick Sabbe a écrit : Simply avoiding the for loops by using lapply (I may have missed a bracket here or there cause I did this without opening R)... Haven't checked the speed up, though. lapply(seq.yvar, function(k){ plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) lapply(seq_along(mydata_list), function(j){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) return(NULL) }) invisible(NULL) }) HTH, Nick Sabbe -- ping: nick.sa...@ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Ivan Calandra Sent: vrijdag 25 februari 2011 11:20 To: r-help Subject: [R] speed up process Dear users, I have a double for loop that does exactly what I want, but is quite slow. It is not so much with this simplified example, but IRL it is slow. Can anyone help me improve it? The data and code for foo_reg() are available at the end of the email; I preferred going directly into the problematic part. Here is the code (I tried to simplify it but I cannot do it too much or else it wouldn't represent my problem). It might also look too complex for what it is intended to do, but my colleagues who are also supposed to use it don't know much about R. So I wrote it so that they don't have to modify the critical parts to run the script for their needs. #column indexes for function ind.xvar- 2 seq.yvar- 3:4 #position vector for legend(), stupid positioning but it doesn't matter here mypos- c(topleft, topright,bottomleft) #run the function for columns 34 as y (seq.yvar) with column 2 as x (ind.xvar) for all 3 datasets (mydata_list) par(mfrow=c(2,1)) for (i in seq_along(seq.yvar)){ k- seq.yvar[i] plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k]) for (j in seq_along(mydata_list)){ foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, pos=mypos[j], name.dat=names(mydata_list)[j]) } } I tried with lapply() or mapply() but couldn't manage to pass the arguments for names() and col= correctly, e.g. for the 2nd loop: lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])}) mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar
Re: [R] speed up the code
Yes, remove the call to intersect, and rely on the results of match to tell you whether there is an overlap. If there are any matches, all(is.na(index)) will be false. Read help for match. ?match -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Hui Du Sent: Wednesday, February 16, 2011 6:29 PM To: r-help@r-project.org Subject: [R] speed up the code Hi All, The following is a snippet of my code. It works fine but it is very slow. Is it possible to speed it up by using different data structure or better solution? For 4 runs, it takes 8 minutes now. Thanks a lot fun_activation = function(s_k, s_hat, x_k, s_hat_len) { common = intersect(s_k, s_hat); if(length(common) != 0) { index = match(common, s_k); round(sum(x_k[index]) * length(common) / (s_hat_len * length(s_k)), 3); } else { 0; } } fun_x = function(a) { round(runif(length(a), 0, 1), 2); } symbol_len = 50; PHI_set = 1:symbol_len; S = matrix(replicate(M * M, sort(sample(PHI_set, sample(symbol_len, 1, M, M); X = matrix(mapply(fun_x, S), M, M); S_hat = c(28, 34, 35) S_hat_len = length(S_hat); S_hat_matrix = matrix(list(S_hat), M, M); system.time( for(I in 1:4) { A = matrix(mapply(fun_activation, S, S_hat_matrix, X, S_hat_len), M, M); } ) HXD [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. message may contain confidential information. If you are not the designated recipient, please notify the sender immediately, and delete the original and any copies. Any use of the message by you is prohibited. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed up the code
Hi All, The following is a snippet of my code. It works fine but it is very slow. Is it possible to speed it up by using different data structure or better solution? For 4 runs, it takes 8 minutes now. Thanks a lot fun_activation = function(s_k, s_hat, x_k, s_hat_len) { common = intersect(s_k, s_hat); if(length(common) != 0) { index = match(common, s_k); round(sum(x_k[index]) * length(common) / (s_hat_len * length(s_k)), 3); } else { 0; } } fun_x = function(a) { round(runif(length(a), 0, 1), 2); } symbol_len = 50; PHI_set = 1:symbol_len; S = matrix(replicate(M * M, sort(sample(PHI_set, sample(symbol_len, 1, M, M); X = matrix(mapply(fun_x, S), M, M); S_hat = c(28, 34, 35) S_hat_len = length(S_hat); S_hat_matrix = matrix(list(S_hat), M, M); system.time( for(I in 1:4) { A = matrix(mapply(fun_activation, S, S_hat_matrix, X, S_hat_len), M, M); } ) HXD [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up subsetting with certain conditions
On 1/12/11 6:44 PM, Duke wrote: Thanks so much for your suggestion Martin. I had Bioconductor installed but I honestly do not know all its applications. Anyway, I am testing GenomicRanges with my data now. I will report back when I get the result. I got the results. My code took ~ 580 min ( ~ 10 hrs) to finish, where as using GenomicRanges per Martin suggested, it took only 22 min (about 30 times less!). Thanks so much for this improvement Martin. D. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] speed up subsetting with certain conditions
Hi folks, I am working on a project that requires subsetting of a found file based on some known file. The known file contains several lines like below: chr132375463237547rs523104280+ chr132375493237550rs520975820+ chr245133264513327rs297692800+ chr245133374513338rs332860090+ where the first column can be chr2, chr1, chr12 etc... The second and third are numbers (cordinates). The found file contains lines like: chr13213435GC chr13237547TC chr13237549GT chr24513326AG chr24513337CG where the first column, again, can be chr1, chr2, chr12 etc... and the second is a number. What I have to do is to separate the found file to two files: one (foundY) contains lines that have the same first column and the second column in range of the two columns 2 and 3 of any line of known file, and one (foundN) contains lines that do not meet the previous condition. For the two examples above, foundN will be the first line, and foundY will be the next 4 lines. What I came up with is this algorithm: * get the uniq item in the first column of found file (chr1, chr2, chr12, chr13 etc...) * for each of the uniq item, set subset of the known file and the found file that have same first column, then scanning each item in the known subset to see if any line meets any condition The code is like below: ## CODE START### # import known and found files to data frames known - read.table( known.txt, sep=\t, header=FALSE ) found - read.table( found.txt, sep=\t, header=FALSE, fill=TRUE ) # get the uniq item in first column of found file found.Chr - as.character(found[!duplicated(found[[1]]),1]) # create two empty result data frames foundN - found[0,] foundY - found[0,] # scan for each of the uniq items for ( iChr in found.Chr ) { # subset of known and found with specific item found.iChr - found[found[[1]]==iChr,] known.iChr - known[known[[1]]==iChr,] # scan through all found subset items if ( nrow(known.iChr)0 ) { for ( i in 1:nrow(found.iChr) ) { if ( nrow(known.iChr[known.iChr[[3]]=found.iChr[i,2] known.iChr[[2]]=found.iChr[i,2],])==0 ) { foundN - rbind( foundN, found.iChr[i,] ) } else { foundY - rbind( foundN, found.iChr[i,] ) } } } } ## CODE END### The code works well, but I tested it for only small known and found files. When trying with larger files (the known file can contains ~ 15 million lines, the found ~ 15k lines), it takes like hrs to run. I want to speed up the process, and I believe there must be a better algorithm to do this with R. My questions are: * any body has a better algorithm or comments or suggestion? * I read (google) that matrices work faster than data frame. Can I use matrices for this case? (is matrices for numbers only?) * I read (google) that I should avoid rbind, and prelocate data frame for faster speed. How would I do that in this case? Thank you very much in advance, Bests, D. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up subsetting with certain conditions
On 1/12/2011 2:52 PM, Duke wrote: Hi folks, I am working on a project that requires subsetting of a found file based on some known file. The known file contains several lines like below: chr132375463237547rs523104280+ chr132375493237550rs520975820+ chr245133264513327rs297692800+ chr245133374513338rs332860090+ where the first column can be chr2, chr1, chr12 etc... The second and third are numbers (cordinates). The found file contains lines like: chr13213435GC chr13237547TC chr13237549GT chr24513326AG chr24513337CG where the first column, again, can be chr1, chr2, chr12 etc... and the second is a number. What I have to do is to separate the found file to two files: one (foundY) contains lines that have the same first column and the second column in range of the two columns 2 and 3 of any line of known file, and one (foundN) contains lines that do not meet the previous condition. For the two examples above, foundN will be the first line, and foundY will be the next 4 lines. What I came up with is this algorithm: * get the uniq item in the first column of found file (chr1, chr2, chr12, chr13 etc...) * for each of the uniq item, set subset of the known file and the found file that have same first column, then scanning each item in the known subset to see if any line meets any condition The code is like below: ## CODE START### # import known and found files to data frames known - read.table( known.txt, sep=\t, header=FALSE ) found - read.table( found.txt, sep=\t, header=FALSE, fill=TRUE ) # get the uniq item in first column of found file found.Chr - as.character(found[!duplicated(found[[1]]),1]) # create two empty result data frames foundN - found[0,] foundY - found[0,] # scan for each of the uniq items for ( iChr in found.Chr ) { # subset of known and found with specific item found.iChr - found[found[[1]]==iChr,] known.iChr - known[known[[1]]==iChr,] # scan through all found subset items if ( nrow(known.iChr)0 ) { for ( i in 1:nrow(found.iChr) ) { if ( nrow(known.iChr[known.iChr[[3]]=found.iChr[i,2] known.iChr[[2]]=found.iChr[i,2],])==0 ) { foundN - rbind( foundN, found.iChr[i,] ) } else { foundY - rbind( foundN, found.iChr[i,] ) } } } } ## CODE END### The code works well, but I tested it for only small known and found files. When trying with larger files (the known file can contains ~ 15 million lines, the found ~ 15k lines), it takes like hrs to run. I want to speed up the process, and I believe there must be a better algorithm to do this with R. My questions are: * any body has a better algorithm or comments or suggestion? The Bioconductor project has many tools for dealing with sequence-related data. With the data k - read.table(textConnection( chr132375463237547rs523104280+ chr132375493237550rs520975820+ chr245133264513327rs297692800+ chr245133374513338rs332860090+)) f - read.table(textConnection( chr13213435GC chr13237547TC chr13237549GT chr24513326AG chr24513337CG)) One might use the GenomicRanges package as library(GenomicRanges) kgr - with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5)) fgr - with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4)) olaps - findOverlaps(fgr, kgr) idx - countOverlaps(fgr, kgr) != 0 resulting in idx [1] FALSE TRUE TRUE TRUE TRUE This will be fast. One could write foundY with as.data.frame(fgr[idx]) (maybe a little editing) but likely one would want to stay in R / Bioc and do something more interesting... See http://bioconductor.org/install/index.html Martin * I read (google) that matrices work faster than data frame. Can I use matrices for this case? (is matrices for numbers only?) * I read (google) that I should avoid rbind, and prelocate data frame for faster speed. How would I do that in this case? Thank you very much in advance, Bests, D. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dr. Martin Morgan, PhD Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] speed up subsetting with certain conditions
On 1/12/11 6:12 PM, Martin Morgan wrote: The Bioconductor project has many tools for dealing with sequence-related data. With the data k - read.table(textConnection( chr132375463237547rs523104280+ chr132375493237550rs520975820+ chr245133264513327rs297692800+ chr245133374513338rs332860090+)) f - read.table(textConnection( chr13213435GC chr13237547TC chr13237549GT chr24513326AG chr24513337CG)) One might use the GenomicRanges package as library(GenomicRanges) kgr - with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5)) fgr - with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4)) olaps - findOverlaps(fgr, kgr) idx - countOverlaps(fgr, kgr) != 0 resulting in idx [1] FALSE TRUE TRUE TRUE TRUE This will be fast. Thanks so much for your suggestion Martin. I had Bioconductor installed but I honestly do not know all its applications. Anyway, I am testing GenomicRanges with my data now. I will report back when I get the result. One could write foundY with as.data.frame(fgr[idx]) (maybe a little editing) but likely one would want to stay in R / Bioc and do something more interesting... I suppose foundN - as.data.frame(fgr[!idx]) and foundY - as.data.frame(fgr[idx]) as you suggested, but I dont really understand your last comment :). Thanks, D. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.