Re: [R] Speed up studentized confidence intervals ?

2022-01-03 Thread varin sacha via R-help
Dear John, Dear Rui,

I really thank you a lot for your R code.

Best,
SV




Le jeudi 30 décembre 2021, 05:25:11 UTC+1, Fox, John  a écrit 
: 





Dear varin sacha,

You didn't correctly adapt the code to the median. The outer call to mean() in 
the last line shouldn't be replaced with median() -- it computes the proportion 
of intervals that include the population median.

As well, you can't rely on the asymptotics of the bootstrap for a nonlinear 
statistic like the median with an n as small as 5, as your example, properly 
implemented (and with the code slightly cleaned up), illustrates:

> library(boot)
> set.seed(123)
> s <- rgamma(n=10, shape=2, rate=5)
> (m <- median(s))
[1] 0.3364465
> N <- 1000
> n <- 5
> set.seed(321)
> out <- replicate(N, {
+  dat <- data.frame(sample(s, size=n))
+  med <- function(d, i) {
+    median(d[i, ])
+  }
+  boot.out <- boot(data = dat, statistic = med, R = 1)
+  boot.ci(boot.out, type = "bca")$bca[, 4:5]
+ })
> #coverage probability
> mean(out[1, ] < m & m < out[2, ])
[1] 0.758


You do get the expected coverage, however, for a larger sample, here with n = 
100:

> N <- 1000
> n <- 100
> set.seed(321)
> out <- replicate(N, {
+  dat <- data.frame(sample(s, size=n))
+  med <- function(d, i) {
+    median(d[i, ])
+  }
+  boot.out <- boot(data = dat, statistic = med, R = 1)
+  boot.ci(boot.out, type = "bca")$bca[, 4:5]
+ })
> #coverage probability
> mean(out[1, ] < m & m < out[2, ])
[1] 0.952

I hope this helps,
John

-- 
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http://socserv.mcmaster.ca/jfox/




On 2021-12-29, 2:09 PM, "R-help on behalf of varin sacha via R-help" 
 wrote:

    Dear David,
    Dear Rui,

    Many thanks for your response. It perfectly works for the mean. Now I have 
a problem with my R code for the median. Because I always get 1 (100%) coverage 
probability that is more than very strange. Indeed, considering that an 
interval whose lower limit is the smallest value in the sample and whose upper 
limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, 
implying that the confidence of such an interval is 15/16 rather than 1 (100%), 
I suspect that the confidence interval I use for the median is not correctly 
defined for n=5 observations, and likely contains all observations in the 
sample ? What is wrong with my R code ?

    
    library(boot)

    s=rgamma(n=10,shape=2,rate=5)
    median(s)

    N <- 100
    out <- replicate(N, {
    a<- sample(s,size=5)
    median(a) 

    dat<-data.frame(a)
    med<-function(d,i) {
    temp<-d[i,]
    median(temp)
    }

      boot.out <- boot(data = dat, statistic = med, R = 1)
      boot.ci(boot.out, type = "bca")$bca[, 4:5]
    })

    #coverage probability
    median(out[1, ] < median(s) & median(s) < out[2, ])
    




    Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas 
 a écrit : 





    Hello,

    The code is running very slowly because you are recreating the function 
    in the replicate() loop and because you are creating a data.frame also 
    in the loop.

    And because in the bootstrap statistic function med() you are computing 
    the variance of yet another loop. This is probably statistically wrong 
    but like David says, without a problem description it's hard to say.

    Also, why compute variances if they are never used?

    Here is complete code executing in much less than 2:00 hours. Note that 
    it passes the vector a directly to med(), not a df with just one column.


    library(boot)

    set.seed(2021)
    s <- sample(178:798, 10, replace = TRUE)
    mean(s)

    med <- function(d, i) {
      temp <- d[i]
      f <- mean(temp)
      g <- var(temp)
      c(Mean = f, Var = g)
    }

    N <- 1000
    out <- replicate(N, {
      a <- sample(s, size = 5)
      boot.out <- boot(data = a, statistic = med, R = 1)
      boot.ci(boot.out, type = "stud")$stud[, 4:5]
    })
    mean(out[1, ] < mean(s) & mean(s) < out[2, ])
    #[1] 0.952



    Hope this helps,

    Rui Barradas

    Às 11:45 de 19/12/21, varin sacha via R-help escreveu:
    > Dear R-experts,
    > 
    > Here below my R code working but really really slowly ! I need 2 hours 
with my computer to finally get an answer ! Is there a way to improve my R code 
to speed it up ? At least to win 1 hour ;=)
    > 
    > Many thanks
    > 
    > 
    > library(boot)
    > 
    > s<- sample(178:798, 10, replace=TRUE)
    > mean(s)
    > 
    > N <- 1000
    > out <- replicate(N, {
    > a<- sample(s,size=5)
    > mean(a)
    > dat<-data.frame(a)
    > 
    > med<-function(d,i) {
    > temp<-d[i,]
    > f<-mean(temp)
    > g<-var(replicate(50,mean(sample(temp,replace=T
    > return(c(f,g))
    > 
    > }
    > 
    >    boot.out <- boot(data = dat, statistic = med, R = 1)
    >    

Re: [R] Speed up studentized confidence intervals ?

2021-12-29 Thread Fox, John
Dear varin sacha,

You didn't correctly adapt the code to the median. The outer call to mean() in 
the last line shouldn't be replaced with median() -- it computes the proportion 
of intervals that include the population median.

As well, you can't rely on the asymptotics of the bootstrap for a nonlinear 
statistic like the median with an n as small as 5, as your example, properly 
implemented (and with the code slightly cleaned up), illustrates:

> library(boot)
> set.seed(123)
> s <- rgamma(n=10, shape=2, rate=5)
> (m <- median(s))
[1] 0.3364465
> N <- 1000
> n <- 5
> set.seed(321)
> out <- replicate(N, {
+   dat <- data.frame(sample(s, size=n))
+   med <- function(d, i) {
+ median(d[i, ])
+   }
+   boot.out <- boot(data = dat, statistic = med, R = 1)
+   boot.ci(boot.out, type = "bca")$bca[, 4:5]
+ })
> #coverage probability
> mean(out[1, ] < m & m < out[2, ])
[1] 0.758


You do get the expected coverage, however, for a larger sample, here with n = 
100:

> N <- 1000
> n <- 100
> set.seed(321)
> out <- replicate(N, {
+   dat <- data.frame(sample(s, size=n))
+   med <- function(d, i) {
+ median(d[i, ])
+   }
+   boot.out <- boot(data = dat, statistic = med, R = 1)
+   boot.ci(boot.out, type = "bca")$bca[, 4:5]
+ })
> #coverage probability
> mean(out[1, ] < m & m < out[2, ])
[1] 0.952

I hope this helps,
 John

-- 
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
Web: http://socserv.mcmaster.ca/jfox/
 
 


On 2021-12-29, 2:09 PM, "R-help on behalf of varin sacha via R-help" 
 wrote:

Dear David,
Dear Rui,

Many thanks for your response. It perfectly works for the mean. Now I have 
a problem with my R code for the median. Because I always get 1 (100%) coverage 
probability that is more than very strange. Indeed, considering that an 
interval whose lower limit is the smallest value in the sample and whose upper 
limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, 
implying that the confidence of such an interval is 15/16 rather than 1 (100%), 
I suspect that the confidence interval I use for the median is not correctly 
defined for n=5 observations, and likely contains all observations in the 
sample ? What is wrong with my R code ?


library(boot)

s=rgamma(n=10,shape=2,rate=5)
median(s)

N <- 100
out <- replicate(N, {
a<- sample(s,size=5)
median(a) 

dat<-data.frame(a)
med<-function(d,i) {
temp<-d[i,]
median(temp)
}

  boot.out <- boot(data = dat, statistic = med, R = 1)
  boot.ci(boot.out, type = "bca")$bca[, 4:5]
})

#coverage probability
median(out[1, ] < median(s) & median(s) < out[2, ])





Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas 
 a écrit : 





Hello,

The code is running very slowly because you are recreating the function 
in the replicate() loop and because you are creating a data.frame also 
in the loop.

And because in the bootstrap statistic function med() you are computing 
the variance of yet another loop. This is probably statistically wrong 
but like David says, without a problem description it's hard to say.

Also, why compute variances if they are never used?

Here is complete code executing in much less than 2:00 hours. Note that 
it passes the vector a directly to med(), not a df with just one column.


library(boot)

set.seed(2021)
s <- sample(178:798, 10, replace = TRUE)
mean(s)

med <- function(d, i) {
  temp <- d[i]
  f <- mean(temp)
  g <- var(temp)
  c(Mean = f, Var = g)
}

N <- 1000
out <- replicate(N, {
  a <- sample(s, size = 5)
  boot.out <- boot(data = a, statistic = med, R = 1)
  boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1, ] < mean(s) & mean(s) < out[2, ])
#[1] 0.952



Hope this helps,

Rui Barradas

Às 11:45 de 19/12/21, varin sacha via R-help escreveu:
> Dear R-experts,
> 
> Here below my R code working but really really slowly ! I need 2 hours 
with my computer to finally get an answer ! Is there a way to improve my R code 
to speed it up ? At least to win 1 hour ;=)
> 
> Many thanks
> 
> 
> library(boot)
> 
> s<- sample(178:798, 10, replace=TRUE)
> mean(s)
> 
> N <- 1000
> out <- replicate(N, {
> a<- sample(s,size=5)
> mean(a)
> dat<-data.frame(a)
> 
> med<-function(d,i) {
> temp<-d[i,]
> f<-mean(temp)
> g<-var(replicate(50,mean(sample(temp,replace=T
> return(c(f,g))
> 
> }
> 
>boot.out <- boot(data = dat, statistic = med, R = 1)
>boot.ci(boot.out, type = "stud")$stud[, 4:5]
> })
> mean(out[1,] < mean(s) & mean(s) < out[2,])
> 

Re: [R] Speed up studentized confidence intervals ?

2021-12-29 Thread David Winsemius



On 12/29/21 11:08 AM, varin sacha via R-help wrote:

Dear David,
Dear Rui,

Many thanks for your response. It perfectly works for the mean. Now I have a 
problem with my R code for the median. Because I always get 1 (100%) coverage 
probability that is more than very strange. Indeed, considering that an 
interval whose lower limit is the smallest value in the sample and whose upper 
limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, 
implying that the confidence of such an interval is 15/16 rather than 1 (100%), 
I suspect that the confidence interval I use for the median is not correctly 
defined for n=5 observations, and likely contains all observations in the 
sample ? What is wrong with my R code ?



Seems to me that doing  a bootstrap within a `replicate` call is not 
needed. (Use one or the other as a mechanism for replication.


Here's what I would consider to be a "bootstrap" operation for 
estimating a 95% CI on the Gamma distributed population you created:


Used a sample size of 1 rather than 10


> quantile( replicate( 1000, {median(sample(s,5))}) , .5+c(-0.475,0.475))
 2.5% 97.5%
0.1343071 0.6848352

This is using boot::boot to calculate medians of samples of size 5

> med <- function( data, indices) {
+ d <- data[indices[1:5]] # allows boot to select sample
+ return( median(d))
+ }
> res <- boot(data=s, med, 1000)

> str(res)
List of 11
 $ t0   : num 0.275
 $ t    : num [1:1000, 1] 0.501 0.152 0.222 0.11 0.444 ...
 $ R    : num 1000
 $ data : num [1:1] 0.7304 0.4062 0.1901 0.0275 0.2748 ...
 $ seed : int [1:626] 10403 431 -118115842 -603122380 -2026881868 
758139796 1148648893 -1161368223 1814605964 -1456558535 ...

 $ statistic:function (data, indices)
  ..- attr(*, "srcref")= 'srcref' int [1:8] 1 8 4 1 8 1 1 4
  .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' 


 $ sim  : chr "ordinary"
 $ call : language boot(data = s, statistic = med, R = 1000)
 $ stype    : chr "i"
 $ strata   : num [1:1] 1 1 1 1 1 1 1 1 1 1 ...
 $ weights  : num [1:1] 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 1e-04 
1e-04 1e-04 1e-04 ...

 - attr(*, "class")= chr "boot"
 - attr(*, "boot_type")= chr "boot"

> quantile( res$t , .5+c(-0.475,0.475))
 2.5% 97.5%
0.1283309 0.6821874






library(boot)

s=rgamma(n=10,shape=2,rate=5)
median(s)

N <- 100
out <- replicate(N, {
a<- sample(s,size=5)
median(a)

dat<-data.frame(a)
med<-function(d,i) {
temp<-d[i,]
median(temp)
}

   boot.out <- boot(data = dat, statistic = med, R = 1)
   boot.ci(boot.out, type = "bca")$bca[, 4:5]
})

#coverage probability
median(out[1, ] < median(s) & median(s) < out[2, ])





Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas  
a écrit :





Hello,

The code is running very slowly because you are recreating the function
in the replicate() loop and because you are creating a data.frame also
in the loop.

And because in the bootstrap statistic function med() you are computing
the variance of yet another loop. This is probably statistically wrong
but like David says, without a problem description it's hard to say.

Also, why compute variances if they are never used?

Here is complete code executing in much less than 2:00 hours. Note that
it passes the vector a directly to med(), not a df with just one column.


library(boot)

set.seed(2021)
s <- sample(178:798, 10, replace = TRUE)
mean(s)

med <- function(d, i) {
   temp <- d[i]
   f <- mean(temp)
   g <- var(temp)
   c(Mean = f, Var = g)
}

N <- 1000
out <- replicate(N, {
   a <- sample(s, size = 5)
   boot.out <- boot(data = a, statistic = med, R = 1)
   boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1, ] < mean(s) & mean(s) < out[2, ])
#[1] 0.952



Hope this helps,

Rui Barradas

Às 11:45 de 19/12/21, varin sacha via R-help escreveu:

Dear R-experts,

Here below my R code working but really really slowly ! I need 2 hours with my 
computer to finally get an answer ! Is there a way to improve my R code to 
speed it up ? At least to win 1 hour ;=)

Many thanks


library(boot)

s<- sample(178:798, 10, replace=TRUE)
mean(s)

N <- 1000
out <- replicate(N, {
a<- sample(s,size=5)
mean(a)
dat<-data.frame(a)

med<-function(d,i) {
temp<-d[i,]
f<-mean(temp)
g<-var(replicate(50,mean(sample(temp,replace=T
return(c(f,g))

}

     boot.out <- boot(data = dat, statistic = med, R = 1)
     boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1,] < mean(s) & mean(s) < out[2,])


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible 

Re: [R] Speed up studentized confidence intervals ?

2021-12-29 Thread varin sacha via R-help
Dear David,
Dear Rui,

Many thanks for your response. It perfectly works for the mean. Now I have a 
problem with my R code for the median. Because I always get 1 (100%) coverage 
probability that is more than very strange. Indeed, considering that an 
interval whose lower limit is the smallest value in the sample and whose upper 
limit is the largest value has 1/32 + 1/32 = 1/16 probability of non-coverage, 
implying that the confidence of such an interval is 15/16 rather than 1 (100%), 
I suspect that the confidence interval I use for the median is not correctly 
defined for n=5 observations, and likely contains all observations in the 
sample ? What is wrong with my R code ?


library(boot)

s=rgamma(n=10,shape=2,rate=5)
median(s)

N <- 100
out <- replicate(N, {
a<- sample(s,size=5)
median(a) 

dat<-data.frame(a)
med<-function(d,i) {
temp<-d[i,]
median(temp)
}

  boot.out <- boot(data = dat, statistic = med, R = 1)
  boot.ci(boot.out, type = "bca")$bca[, 4:5]
})

#coverage probability
median(out[1, ] < median(s) & median(s) < out[2, ])





Le jeudi 23 décembre 2021, 14:10:36 UTC+1, Rui Barradas  
a écrit : 





Hello,

The code is running very slowly because you are recreating the function 
in the replicate() loop and because you are creating a data.frame also 
in the loop.

And because in the bootstrap statistic function med() you are computing 
the variance of yet another loop. This is probably statistically wrong 
but like David says, without a problem description it's hard to say.

Also, why compute variances if they are never used?

Here is complete code executing in much less than 2:00 hours. Note that 
it passes the vector a directly to med(), not a df with just one column.


library(boot)

set.seed(2021)
s <- sample(178:798, 10, replace = TRUE)
mean(s)

med <- function(d, i) {
  temp <- d[i]
  f <- mean(temp)
  g <- var(temp)
  c(Mean = f, Var = g)
}

N <- 1000
out <- replicate(N, {
  a <- sample(s, size = 5)
  boot.out <- boot(data = a, statistic = med, R = 1)
  boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1, ] < mean(s) & mean(s) < out[2, ])
#[1] 0.952



Hope this helps,

Rui Barradas

Às 11:45 de 19/12/21, varin sacha via R-help escreveu:
> Dear R-experts,
> 
> Here below my R code working but really really slowly ! I need 2 hours with 
> my computer to finally get an answer ! Is there a way to improve my R code to 
> speed it up ? At least to win 1 hour ;=)
> 
> Many thanks
> 
> 
> library(boot)
> 
> s<- sample(178:798, 10, replace=TRUE)
> mean(s)
> 
> N <- 1000
> out <- replicate(N, {
> a<- sample(s,size=5)
> mean(a)
> dat<-data.frame(a)
> 
> med<-function(d,i) {
> temp<-d[i,]
> f<-mean(temp)
> g<-var(replicate(50,mean(sample(temp,replace=T
> return(c(f,g))
> 
> }
> 
>    boot.out <- boot(data = dat, statistic = med, R = 1)
>    boot.ci(boot.out, type = "stud")$stud[, 4:5]
> })
> mean(out[1,] < mean(s) & mean(s) < out[2,])
> 
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up studentized confidence intervals ?

2021-12-23 Thread Rui Barradas

Hello,

The code is running very slowly because you are recreating the function 
in the replicate() loop and because you are creating a data.frame also 
in the loop.


And because in the bootstrap statistic function med() you are computing 
the variance of yet another loop. This is probably statistically wrong 
but like David says, without a problem description it's hard to say.


Also, why compute variances if they are never used?

Here is complete code executing in much less than 2:00 hours. Note that 
it passes the vector a directly to med(), not a df with just one column.



library(boot)

set.seed(2021)
s <- sample(178:798, 10, replace = TRUE)
mean(s)

med <- function(d, i) {
  temp <- d[i]
  f <- mean(temp)
  g <- var(temp)
  c(Mean = f, Var = g)
}

N <- 1000
out <- replicate(N, {
  a <- sample(s, size = 5)
  boot.out <- boot(data = a, statistic = med, R = 1)
  boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1, ] < mean(s) & mean(s) < out[2, ])
#[1] 0.952



Hope this helps,

Rui Barradas

Às 11:45 de 19/12/21, varin sacha via R-help escreveu:

Dear R-experts,

Here below my R code working but really really slowly ! I need 2 hours with my 
computer to finally get an answer ! Is there a way to improve my R code to 
speed it up ? At least to win 1 hour ;=)

Many thanks


library(boot)

s<- sample(178:798, 10, replace=TRUE)
mean(s)

N <- 1000
out <- replicate(N, {
a<- sample(s,size=5)
mean(a)
dat<-data.frame(a)

med<-function(d,i) {
temp<-d[i,]
f<-mean(temp)
g<-var(replicate(50,mean(sample(temp,replace=T
return(c(f,g))

}

   boot.out <- boot(data = dat, statistic = med, R = 1)
   boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1,] < mean(s) & mean(s) < out[2,])


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up studentized confidence intervals ?

2021-12-22 Thread David Winsemius
I’m wondering if this is an X-Y problem. (A request to do X when the real 
problem should be doing Y. ) You haven’t explained the goals in natural or 
mathematical language which is leaving me to wonder why you are doing either 
sampling or replication (much less doing both within each iteration in the the 
function given to boot. )

— 
David

Sent from my iPhone

> On Dec 19, 2021, at 3:50 AM, varin sacha via R-help  
> wrote:
> 
> Dear R-experts,
> 
> Here below my R code working but really really slowly ! I need 2 hours with 
> my computer to finally get an answer ! Is there a way to improve my R code to 
> speed it up ? At least to win 1 hour ;=)
> 
> Many thanks
> 
> 
> library(boot)
> 
> s<- sample(178:798, 10, replace=TRUE)
> mean(s)
> 
> N <- 1000
> out <- replicate(N, {
> a<- sample(s,size=5)
> mean(a)
> dat<-data.frame(a)
> 
> med<-function(d,i) {
> temp<-d[i,]
> f<-mean(temp)
> g<-var(replicate(50,mean(sample(temp,replace=T
> return(c(f,g))
> 
> }
> 
>   boot.out <- boot(data = dat, statistic = med, R = 1)
>   boot.ci(boot.out, type = "stud")$stud[, 4:5]
> })
> mean(out[1,] < mean(s) & mean(s) < out[2,]) 
> 
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed up studentized confidence intervals ?

2021-12-19 Thread varin sacha via R-help
Dear R-experts,

Here below my R code working but really really slowly ! I need 2 hours with my 
computer to finally get an answer ! Is there a way to improve my R code to 
speed it up ? At least to win 1 hour ;=)

Many thanks


library(boot)

s<- sample(178:798, 10, replace=TRUE)
mean(s)

N <- 1000
out <- replicate(N, {
a<- sample(s,size=5)
mean(a)
dat<-data.frame(a)

med<-function(d,i) {
temp<-d[i,]
f<-mean(temp)
g<-var(replicate(50,mean(sample(temp,replace=T
return(c(f,g))

}

  boot.out <- boot(data = dat, statistic = med, R = 1)
  boot.ci(boot.out, type = "stud")$stud[, 4:5]
})
mean(out[1,] < mean(s) & mean(s) < out[2,]) 


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] work on R speed?

2019-06-13 Thread Jeff Newmiller
Your question seems like an information-free zone. "Quick" is an opinion unless 
you set the boundaries of your question much more precisely. The Posting Guide 
strongly recommends providing a reproducible example of what you want to 
discuss. In this case I would suggest that you use the microbenchmark package 
to quantify "quick" or "not quick".

In my experience, the most significant factors affecting speed are algorithms 
and features. You may be comparing a general purpose complete analysis function 
in one environment with a specific part of that analysis in another environment.

On June 12, 2019 7:36:13 AM PDT, "Kai Lähteenmäki" 
 wrote:
>I tested Microsoft's linear algebra etc "racer", works well
>Alsp simmer seems to be very quick.
>How other developments in getting R quick?
>reg Kai
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] work on R speed?

2019-06-13 Thread Kai Lähteenmäki
I tested Microsoft's linear algebra etc "racer", works well
Alsp simmer seems to be very quick.
How other developments in getting R quick?
reg Kai

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed of RCppEigen Cholesky decomposition on sparse matrix

2018-11-21 Thread Jeff Newmiller
I believe you have the wrong list. (Read the Posting Guide... you seem to have 
R under control.)  Try Rcpp-devel.

FWIW You probably need to spend some time with a C++ profiler... any language 
can be unintentionally mis-used, and you first need to identify whether your 
calling code is inefficiently handling memory or invoking setup code 
repetitively before blaming BLAS. A reproducible example will probably help 
when you ask at Rcpp-devel.

On November 21, 2018 10:34:33 AM PST, "Hoffman, Gabriel" 
 wrote:
>I am developing a statistical model and I have a prototype working in R
>code.  I make extensive use of sparse matrices, so the R code is pretty
>fast, but hoped that using RCppEigen to evaluate the log-likelihood
>function could avoid a lot of memory copying and be substantially
>faster.  However, in a simple  example I am seeing that RCppEigen is
>3-5x slower than standard R code for cholesky decomposition of a sparse
>matrix.  This is the case on R 3.5.1 using RcppEigen_0.3.3.4.0 on both
>OS X and CentOS 6.9.
>
>Since this simple operation is so much slower it doesn�t seem like
>using RCppEigen is worth it in this case.  Is this an issue with BLAS,
>some libraries or compiler options, or is R code really the fastest
>option?
>
>Here is my example:
>
>library(Matrix)
>library(inline)
>
># construct sparse matrix
>#
>
># construct a matrix C that is N x X with S total entries
>N = 1
>S = 100
>i = sample(1:1000, S, replace=TRUE)
>j = sample(1:1000, S, replace=TRUE)
>idx = i >= j
>values = runif(S, 0, .3)
>X = sparseMatrix(i=i, j=j, x = values, symmetric=FALSE )
>
>C = as(crossprod(X), "dgCMatrix")
>
># check sparsity fraction
>S / N^2
>
># define RCppEigen code
>CholeskyCppSparse<-'
>using Rcpp::as;
>using Eigen::Map;
>using Eigen::SparseMatrix;
>using Eigen::MappedSparseMatrix;
>using Eigen::SimplicialLLT;
>
>// get data into RcppEigen
>const MappedSparseMatrix Sigma(as
>>(Sigma_in));
>
>// compute Cholesky
>typedef SimplicialLLT > SpChol;
>const SpChol Ch(Sigma);
>'
>
>CholSparse <- cxxfunction(signature(Sigma_in = "dgCMatrix"),
>CholeskyCppSparse, plugin = "RcppEigen")
>
># compare times
>system.time(replicate(10, chol( C )))
># output:
>#   user  system elapsed
>#  0.341   0.014   0.355
>
>system.time(replicate(10, CholSparse( C )))
># output:
>#   user  system elapsed
># 1.639   0.046   1.687
>
>> sessionInfo()
>R version 3.5.1 (2018-07-02)
>Platform: x86_64-apple-darwin15.6.0 (64-bit)
>Running under: macOS  10.14
>
>Matrix products: default
>BLAS:
>/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
>LAPACK:
>/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
>
>locale:
>[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
>attached base packages:
>[1] stats graphics  grDevices datasets  utils methods   base
>
>other attached packages:
>[1] inline_0.3.15 Matrix_1.2-15
>
>loaded via a namespace (and not attached):
>[1] compiler_3.5.1  RcppEigen_0.3.3.4.0 Rcpp_1.0.0
>[4] grid_3.5.1  lattice_0.20-38
>
>Changing the size of the matrix and the number of entries does not
>change the relative times
>
>Thanks,
>- Gabriel
>
>
>
>
>   [[alternative HTML version deleted]]

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed of RCppEigen Cholesky decomposition on sparse matrix

2018-11-21 Thread Hoffman, Gabriel
I am developing a statistical model and I have a prototype working in R code.  
I make extensive use of sparse matrices, so the R code is pretty fast, but 
hoped that using RCppEigen to evaluate the log-likelihood function could avoid 
a lot of memory copying and be substantially faster.  However, in a simple  
example I am seeing that RCppEigen is 3-5x slower than standard R code for 
cholesky decomposition of a sparse matrix.  This is the case on R 3.5.1 using 
RcppEigen_0.3.3.4.0 on both OS X and CentOS 6.9.

Since this simple operation is so much slower it doesn�t seem like using 
RCppEigen is worth it in this case.  Is this an issue with BLAS, some libraries 
or compiler options, or is R code really the fastest option?

Here is my example:

library(Matrix)
library(inline)

# construct sparse matrix
#

# construct a matrix C that is N x X with S total entries
N = 1
S = 100
i = sample(1:1000, S, replace=TRUE)
j = sample(1:1000, S, replace=TRUE)
idx = i >= j
values = runif(S, 0, .3)
X = sparseMatrix(i=i, j=j, x = values, symmetric=FALSE )

C = as(crossprod(X), "dgCMatrix")

# check sparsity fraction
S / N^2

# define RCppEigen code
CholeskyCppSparse<-'
using Rcpp::as;
using Eigen::Map;
using Eigen::SparseMatrix;
using Eigen::MappedSparseMatrix;
using Eigen::SimplicialLLT;

// get data into RcppEigen
const MappedSparseMatrix Sigma(as 
>(Sigma_in));

// compute Cholesky
typedef SimplicialLLT > SpChol;
const SpChol Ch(Sigma);
'

CholSparse <- cxxfunction(signature(Sigma_in = "dgCMatrix"), CholeskyCppSparse, 
plugin = "RcppEigen")

# compare times
system.time(replicate(10, chol( C )))
# output:
#   user  system elapsed
#  0.341   0.014   0.355

system.time(replicate(10, CholSparse( C )))
# output:
#   user  system elapsed
# 1.639   0.046   1.687

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14

Matrix products: default
BLAS: 
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: 
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices datasets  utils methods   base

other attached packages:
[1] inline_0.3.15 Matrix_1.2-15

loaded via a namespace (and not attached):
[1] compiler_3.5.1  RcppEigen_0.3.3.4.0 Rcpp_1.0.0
[4] grid_3.5.1  lattice_0.20-38

Changing the size of the matrix and the number of entries does not change the 
relative times

Thanks,
- Gabriel




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed issue in simulating a stochastic process

2014-11-06 Thread Matteo Richiardi
I wish to simulate the following stochastic process, for i = 1...N
individuals and t=1...T periods:

y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t}

where Ey_{t-1} is the average of y over the N individuals computed at time
t-1.

My solution (below) works but is incredibly slow. Is there a faster but
still clear and readable alternative?

Thanks a lot. Matteo

rm(list=ls())
library(plyr)
y0 = 0
lambda = 0.1
N = 20
T = 100
m_e = 0
sd_e = 1

# construct the data frame and initialize y
D = data.frame(
  id = rep(1:N,T),
  t = rep(1:T, each = N),
  y = rep(y0,N*T)
)

# update y
for(t in 2:T){
  ybar.L1 = mean(D[D$t==t-1,y])
  for(i in 1:N){
epsilon = rnorm(1,mean=m_e,sd=sd_e)
D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
  }
}

ybar - ddply(D,~t,summarise,mean=mean(y))

plot(ybar, col = blue, type = l)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue in simulating a stochastic process

2014-11-06 Thread Thomas Adams
Matteo,

I tried your example code using R 3.1.1 on an iMac (24-inch, Early 2009), 3.06
GHz Intel Core 2 Duo, 8 GB 1333 MHz DDR3, NVIDIA GeForce GT 130 512 MB
running Mac OS X 10.10 (Yosemite).

After entering your code, the elapsed time from the time I hit return to
when the graphics appeared was about 2 seconds — is this about what you are
seeing?

Regards,
Tom



On Thu, Nov 6, 2014 at 7:47 AM, Matteo Richiardi matteo.richia...@gmail.com
 wrote:

 I wish to simulate the following stochastic process, for i = 1...N
 individuals and t=1...T periods:

 y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t}

 where Ey_{t-1} is the average of y over the N individuals computed at time
 t-1.

 My solution (below) works but is incredibly slow. Is there a faster but
 still clear and readable alternative?

 Thanks a lot. Matteo

 rm(list=ls())
 library(plyr)
 y0 = 0
 lambda = 0.1
 N = 20
 T = 100
 m_e = 0
 sd_e = 1

 # construct the data frame and initialize y
 D = data.frame(
   id = rep(1:N,T),
   t = rep(1:T, each = N),
   y = rep(y0,N*T)
 )

 # update y
 for(t in 2:T){
   ybar.L1 = mean(D[D$t==t-1,y])
   for(i in 1:N){
 epsilon = rnorm(1,mean=m_e,sd=sd_e)
 D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
   }
 }

 ybar - ddply(D,~t,summarise,mean=mean(y))

 plot(ybar, col = blue, type = l)

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue in simulating a stochastic process

2014-11-06 Thread Thomas Adams
Matteo,

Ah — OK, N=20, I did not catch that. You have nested for loops, which R is
known to be exceedingly slow at handling — if you can reorganize the code
to eliminate the loops, your performance will increase significantly.

Tom

On Thu, Nov 6, 2014 at 7:47 AM, Matteo Richiardi matteo.richia...@gmail.com
 wrote:

 I wish to simulate the following stochastic process, for i = 1...N
 individuals and t=1...T periods:

 y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t}

 where Ey_{t-1} is the average of y over the N individuals computed at time
 t-1.

 My solution (below) works but is incredibly slow. Is there a faster but
 still clear and readable alternative?

 Thanks a lot. Matteo

 rm(list=ls())
 library(plyr)
 y0 = 0
 lambda = 0.1
 N = 20
 T = 100
 m_e = 0
 sd_e = 1

 # construct the data frame and initialize y
 D = data.frame(
   id = rep(1:N,T),
   t = rep(1:T, each = N),
   y = rep(y0,N*T)
 )

 # update y
 for(t in 2:T){
   ybar.L1 = mean(D[D$t==t-1,y])
   for(i in 1:N){
 epsilon = rnorm(1,mean=m_e,sd=sd_e)
 D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
   }
 }

 ybar - ddply(D,~t,summarise,mean=mean(y))

 plot(ybar, col = blue, type = l)

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue in simulating a stochastic process

2014-11-06 Thread William Dunlap
I find that representing the simulated data as a T row by N column matrix
allows for a clearer and faster simulation function.  E.g., compare the
output of the following two functions, the first of which uses your code
and the second a matrix representation (which I convert to a data.frame at
the end so I can compare outputs easily).  I timed both of them for T=10^3
times and N=50 individuals; both gave the same results and f1 was 1
times faster than f0:
   set.seed(1); t0 - system.time(s0 - f0(N=50,T=1000))
   set.seed(1); t1 - system.time(s1 - f1(N=50,T=1000))
   rbind(t0, t1)
 user.self sys.self elapsed user.child sys.child
  t0436.87 0.11  438.48 NANA
  t1  0.04 0.000.04 NANA
   all.equal(s0, s1)
  [1] TRUE

The functions are:

f0 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0)
{
  # construct the data frame and initialize y
  D - data.frame(
id = rep(1:N,T),
t = rep(1:T, each = N),
y = rep(y0,N*T)
  )

  # update y
  for(t in 2:T){
ybar.L1 = mean(D[D$t==t-1,y])
for(i in 1:N){
  epsilon = rnorm(1,mean=m_e,sd=sd_e)
  D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
}
  }
  D
}

f1 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0)
{
  # same process simulated using a matrix representation
  #   The T rows are times, the N columns are individuals
  M - matrix(y0, nrow=T, ncol=N)
  if (T  1) for(t in 2:T) {
ybar.L1 - mean(M[t-1L,])
epsilon - rnorm(N, mean=m_e, sd=sd_e)
M[t,] - lambda * y0 + (1-lambda)*ybar.L1 + epsilon
  }
  # convert to the data.frame representation that f0 uses
  tM - t(M)
  data.frame(id = as.vector(row(tM)), t = as.vector(col(tM)), y =
as.vector(tM))
}



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Nov 6, 2014 at 6:47 AM, Matteo Richiardi matteo.richia...@gmail.com
 wrote:

 I wish to simulate the following stochastic process, for i = 1...N
 individuals and t=1...T periods:

 y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t}

 where Ey_{t-1} is the average of y over the N individuals computed at time
 t-1.

 My solution (below) works but is incredibly slow. Is there a faster but
 still clear and readable alternative?

 Thanks a lot. Matteo

 rm(list=ls())
 library(plyr)
 y0 = 0
 lambda = 0.1
 N = 20
 T = 100
 m_e = 0
 sd_e = 1

 # construct the data frame and initialize y
 D = data.frame(
   id = rep(1:N,T),
   t = rep(1:T, each = N),
   y = rep(y0,N*T)
 )

 # update y
 for(t in 2:T){
   ybar.L1 = mean(D[D$t==t-1,y])
   for(i in 1:N){
 epsilon = rnorm(1,mean=m_e,sd=sd_e)
 D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
   }
 }

 ybar - ddply(D,~t,summarise,mean=mean(y))

 plot(ybar, col = blue, type = l)

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue in simulating a stochastic process

2014-11-06 Thread William Dunlap
Loops are not slow, but your code did a lot of unneeded operations in each
loop.
E.g, you computed
D$id==i  D$t==t
for each row of D.  That involves 2*nrow(D) equality tests for each of the
nrow(D)
rows, i.e., it is quadratic in N*T.

Then you did a data.frame replacement operation
D[k,]$y - newValue
where k is D$id==1D$t==t.  This extracts the k'th row of D, then extracts
the 1-row 'y' column
from it, replaces it with the new value, then puts that row back into D.
If you must use
a data.frame, the equivalent
   D$y[k] - newValue
is probably much faster (data.frames are lists of columns, so replacing a
column is fast).

Using a matrix to organize things is less flexible, but faster because you
don't have to search
when you want to find the element for a given id and time - you just do a
little arithmetic to
get the offset from the start of the matrix.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Nov 6, 2014 at 2:05 PM, Matteo Richiardi matteo.richia...@gmail.com
 wrote:

 Hi William,
 that's super. Thanks a lot. I knew that R is slow with loops, but did not
 imagine so slow! B.t.w., what's the reason?
 Final question: in your code you have mean(M[t-1L,]): what is the 'L'
 for? I removed it at apparently the code produces the same output...

 Thanks again,
 Matteo

 On 6 November 2014 18:46, William Dunlap wdun...@tibco.com wrote:

 I find that representing the simulated data as a T row by N column matrix
 allows for a clearer and faster simulation function.  E.g., compare the
 output of the following two functions, the first of which uses your code
 and the second a matrix representation (which I convert to a data.frame at
 the end so I can compare outputs easily).  I timed both of them for T=10^3
 times and N=50 individuals; both gave the same results and f1 was 1
 times faster than f0:
set.seed(1); t0 - system.time(s0 - f0(N=50,T=1000))
set.seed(1); t1 - system.time(s1 - f1(N=50,T=1000))
rbind(t0, t1)
  user.self sys.self elapsed user.child sys.child
   t0436.87 0.11  438.48 NANA
   t1  0.04 0.000.04 NANA
all.equal(s0, s1)
   [1] TRUE

 The functions are:

 f0 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0)
 {
   # construct the data frame and initialize y
   D - data.frame(
 id = rep(1:N,T),
 t = rep(1:T, each = N),
 y = rep(y0,N*T)
   )

   # update y
   for(t in 2:T){
 ybar.L1 = mean(D[D$t==t-1,y])
 for(i in 1:N){
   epsilon = rnorm(1,mean=m_e,sd=sd_e)
   D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
 }
   }
   D
 }

 f1 - function(N = 20, T = 100, lambda = 0.1, m_e = 0, sd_e = 1, y0 = 0)
 {
   # same process simulated using a matrix representation
   #   The T rows are times, the N columns are individuals
   M - matrix(y0, nrow=T, ncol=N)
   if (T  1) for(t in 2:T) {
 ybar.L1 - mean(M[t-1L,])
 epsilon - rnorm(N, mean=m_e, sd=sd_e)
 M[t,] - lambda * y0 + (1-lambda)*ybar.L1 + epsilon
   }
   # convert to the data.frame representation that f0 uses
   tM - t(M)
   data.frame(id = as.vector(row(tM)), t = as.vector(col(tM)), y =
 as.vector(tM))
 }



 Bill Dunlap
 TIBCO Software
 wdunlap tibco.com

 On Thu, Nov 6, 2014 at 6:47 AM, Matteo Richiardi 
 matteo.richia...@gmail.com wrote:

 I wish to simulate the following stochastic process, for i = 1...N
 individuals and t=1...T periods:

 y_{i,t} = y_0 + lambda Ey_{t-1} + epsilon_{i,t}

 where Ey_{t-1} is the average of y over the N individuals computed at
 time
 t-1.

 My solution (below) works but is incredibly slow. Is there a faster but
 still clear and readable alternative?

 Thanks a lot. Matteo

 rm(list=ls())
 library(plyr)
 y0 = 0
 lambda = 0.1
 N = 20
 T = 100
 m_e = 0
 sd_e = 1

 # construct the data frame and initialize y
 D = data.frame(
   id = rep(1:N,T),
   t = rep(1:T, each = N),
   y = rep(y0,N*T)
 )

 # update y
 for(t in 2:T){
   ybar.L1 = mean(D[D$t==t-1,y])
   for(i in 1:N){
 epsilon = rnorm(1,mean=m_e,sd=sd_e)
 D[D$id==i  D$t==t,]$y = lambda*y0+(1-lambda)*ybar.L1+epsilon
   }
 }

 ybar - ddply(D,~t,summarise,mean=mean(y))

 plot(ybar, col = blue, type = l)

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue in simulating a stochastic process

2014-11-06 Thread Rolf Turner


SNIP


On Thu, Nov 6, 2014 at 2:05 PM, Matteo Richiardi matteo.richia...@gmail.com

wrote:


SNIP


Final question: in your code you have mean(M[t-1L,]): what is the 'L'
for? I removed it at apparently the code produces the same output...


SNIP

The constant 1L is stored as an integer; the constant 1 is stored as 
double precision.  This sometimes makes no difference and sometimes 
makes a huge difference (especially in the context of numerical 
comparisons).  If something is supposed to be an integer it is safer to 
use the L form.


See ?NumericConstants.

cheers,

Rolf Turner

--
Rolf Turner
Technical Editor ANZJS

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R speed test - for processor and for RAM size

2014-04-10 Thread Dimitri Liakhovitski
Hello!

I am sorry if my question sounds naive; it's because I am not a computer
scientist.
I understand that two factors impact a PC's speed, the processor and
(indirectly), the RAM size.

I would like to run a speed test in R (under Windows). I found lots of
different code snippets testing the speed.
However, I'd like to get some hints or find a code that (loosely) has:

(a) Aspect A that makes the code run faster if the processing speed of the
PC is higher.
(b) Aspect B that makes the code run faster if your PC's RAM is larger

Even if the 2 Aspects are not 100% independent, it would still be OK. I am
just trying to isolate the impact of those 2 things (processor and RAM) on
speed of this code.
This way our IT people could change some parameter in the code that impacts
Aspect A or another parameter that impacts Aspect B.

Thanks a lot for your hints!


-- 
Dimitri Liakhovitski

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread SPi
Good idea! 

I'm trying your approach right now, but I am wondering if using str_split
(package: 'stringr') or strsplit is the right way to go in terms of speed? I
ran str_split over the text column of the data frame and it's processing for
2 hours now..? 

I did: 
splittedStrings-str_split(dataframe$text,  )

The $text column already contains cleaned text, so no double blanks etc or
unnecessary symbols. Just full words.




--
View this message in context: 
http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679904.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread SPi
I'll answer myself:
using strsplit with fixed=true took like 2minutes!



--
View this message in context: 
http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679905.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-06 Thread Carl Witthoft
If you could, please identify which responder's idea you used, as well as the
strsplit -- related code you ended up with.
That may help someone who browses the mail archives in the future.

Carl


SPi wrote
 I'll answer myself:
 using strsplit with fixed=true took like 2minutes!





--
View this message in context: 
http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679906.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Jeff Newmiller
It is not reproducible [1] because I cannot run your (representative) example. 
The type of regex pattern,  token, and even the character of the data you are 
searching can affect possible optimizations. Note that a non-memory-resident 
tool such as sed or perl may be an appropriate tool for a problem like this.

[1] 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

Simon Pickert simon.pick...@t-online.de wrote:
How’s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector,
“[token]“,dataframe$text_column)  )
4. General question: How to speed up string operations on ‘large' data
sets?


Please let me know what more information you need in order to reproduce
this example? 
It’s more a general type of question, while I think the description
above gives you a specific picture of what I’m doing right now.






General question: 
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller
jdnew...@dcn.davis.ca.us:

 Example not reproducible. Communication fail. Please refer to Posting
Guide.

---
 Jeff NewmillerThe .   .  Go
Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live
Go...
  Live:   OO#.. Dead: OO#.. 
Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#. 
rocks...1k

---

 Sent from my phone. Please excuse my brevity.
 
 Simon Pickert simon.pick...@t-online.de wrote:
 Hi R’lers,
 
 I’m running into speeding issues, performing a bunch of 
 
 „gsub(patternvector, [token],dataframe$text_column)
 
 on a data frame containing 4millionentries.
 
 (The “patternvectors“ contain up to 500 elements) 
 
 Is there any better/faster way than performing like 20 gsub commands
in
 a row?
 
 
 Thanks!
 Simon
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Simon Pickert
How’s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, 
“[token]“,dataframe$text_column)  )
4. General question: How to speed up string operations on ‘large' data sets?


Please let me know what more information you need in order to reproduce this 
example? 
It’s more a general type of question, while I think the description above gives 
you a specific picture of what I’m doing right now.






General question: 
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us:

 Example not reproducible. Communication fail. Please refer to Posting Guide.
 ---
 Jeff NewmillerThe .   .  Go Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
 --- 
 Sent from my phone. Please excuse my brevity.
 
 Simon Pickert simon.pick...@t-online.de wrote:
 Hi R’lers,
 
 I’m running into speeding issues, performing a bunch of 
 
 „gsub(patternvector, [token],dataframe$text_column)
 
 on a data frame containing 4millionentries.
 
 (The “patternvectors“ contain up to 500 elements) 
 
 Is there any better/faster way than performing like 20 gsub commands in
 a row?
 
 
 Thanks!
 Simon
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Jim Holtman
what is missing is any idea of what the 'patterns' are that you are searching 
for.  Regular expressions are very sensitive to how you specify the pattern.  
you indicated that you have up to 500 elements in the pattern, so what does it 
look like?  alternation and backtracking can be very expensive.  so a lot more 
specificity is required.  there are whole books written on how pattern matching 
works and what is hard and what is easy.  this is true for wherever regular 
expressions are used, not just in R.  also some idea of what the timing is; are 
you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote:

 How’s that not reproducible?
 
 1. Data frame, one column with text strings
 2. Size of data frame= 4million observations
 3. A bunch of gsubs in a row (  gsub(patternvector, 
 “[token]“,dataframe$text_column)  )
 4. General question: How to speed up string operations on ‘large' data sets?
 
 
 Please let me know what more information you need in order to reproduce this 
 example? 
 It’s more a general type of question, while I think the description above 
 gives you a specific picture of what I’m doing right now.
 
 
 
 
 
 
 General question: 
 Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us:
 
 Example not reproducible. Communication fail. Please refer to Posting Guide.
 ---
 Jeff NewmillerThe .   .  Go Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
 Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
 --- 
 Sent from my phone. Please excuse my brevity.
 
 Simon Pickert simon.pick...@t-online.de wrote:
 Hi R’lers,
 
 I’m running into speeding issues, performing a bunch of 
 
 „gsub(patternvector, [token],dataframe$text_column)
 
 on a data frame containing 4millionentries.
 
 (The “patternvectors“ contain up to 500 elements) 
 
 Is there any better/faster way than performing like 20 gsub commands in
 a row?
 
 
 Thanks!
 Simon
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Prof Brian Ripley

But note too what the help says:

Performance considerations:

 If you are doing a lot of regular expression matching, including
 on very long strings, you will want to consider the options used.
 Generally PCRE will be faster than the default regular expression
 engine, and ‘fixed = TRUE’ faster still (especially when each
 pattern is matched only a few times).

(and there is more).  I don't see perl=TRUE here.

On 05/11/2013 09:06, Jim Holtman wrote:

what is missing is any idea of what the 'patterns' are that you are searching 
for.  Regular expressions are very sensitive to how you specify the pattern.  
you indicated that you have up to 500 elements in the pattern, so what does it 
look like?  alternation and backtracking can be very expensive.  so a lot more 
specificity is required.  there are whole books written on how pattern matching 
works and what is hard and what is easy.  this is true for wherever regular 
expressions are used, not just in R.  also some idea of what the timing is; are 
you talking about 1-10-100 seconds/minutes/hours.

Sent from my iPad

On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote:


How’s that not reproducible?

1. Data frame, one column with text strings
2. Size of data frame= 4million observations
3. A bunch of gsubs in a row (  gsub(patternvector, 
“[token]“,dataframe$text_column)  )
4. General question: How to speed up string operations on ‘large' data sets?


Please let me know what more information you need in order to reproduce this 
example?
It’s more a general type of question, while I think the description above gives 
you a specific picture of what I’m doing right now.






General question:
Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us:


Example not reproducible. Communication fail. Please refer to Posting Guide.
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
 Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
---
Sent from my phone. Please excuse my brevity.

Simon Pickert simon.pick...@t-online.de wrote:

Hi R’lers,

I’m running into speeding issues, performing a bunch of

„gsub(patternvector, [token],dataframe$text_column)

on a data frame containing 4millionentries.

(The “patternvectors“ contain up to 500 elements)

Is there any better/faster way than performing like 20 gsub commands in
a row?


Thanks!
Simon

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Simon Pickert
Thanks everybody! Now I understand the need for more details:

the patterns for the gsubs are of different kinds.First, I have character 
strings, I need to replace. Therefore, I have around 5000 stock ticker symbols 
(e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. 
Second, I have four vectors with regular expressions, all similar to this on: 
replace_url - c(„https?://.*\\s|www.*\\s“) 

The text strings I perform the gsub commands on, look like this (no string is 
longer than 200 characters):

'GOOGL announced new partnership www.url.com. Stock price is up +5%‘

After performing several gsubs in a row, like

gsub(replace_url, “[url]“,dataframe$text_column) 
gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) 
etc. 

this string will look like this:

'[sp500_ticker] announced new partnership [url]. Stock price is up 
[positive_percentage]‘


The dataset contains 4 million entries. The code works, but I I cancelled the 
process after 1 day (my whole system was blocked while R was running). 
Performing the code on a smaller chunck of data (1 million) took about 12hrs. 
As far as I can say, replacing the ticker symbols takes the longest, while the 
regular expressions went quite fast

Thanks!



Am 05.11.2013 um 11:31 schrieb Prof Brian Ripley rip...@stats.ox.ac.uk:

 But note too what the help says:
 
 Performance considerations:
 
 If you are doing a lot of regular expression matching, including
 on very long strings, you will want to consider the options used.
 Generally PCRE will be faster than the default regular expression
 engine, and ‘fixed = TRUE’ faster still (especially when each
 pattern is matched only a few times).
 
 (and there is more).  I don't see perl=TRUE here.
 
 On 05/11/2013 09:06, Jim Holtman wrote:
 what is missing is any idea of what the 'patterns' are that you are 
 searching for.  Regular expressions are very sensitive to how you specify 
 the pattern.  you indicated that you have up to 500 elements in the pattern, 
 so what does it look like?  alternation and backtracking can be very 
 expensive.  so a lot more specificity is required.  there are whole books 
 written on how pattern matching works and what is hard and what is easy.  
 this is true for wherever regular expressions are used, not just in R.  also 
 some idea of what the timing is; are you talking about 1-10-100 
 seconds/minutes/hours.
 
 Sent from my iPad
 
 On Nov 5, 2013, at 3:13, Simon Pickert simon.pick...@t-online.de wrote:
 
 How’s that not reproducible?
 
 1. Data frame, one column with text strings
 2. Size of data frame= 4million observations
 3. A bunch of gsubs in a row (  gsub(patternvector, 
 “[token]“,dataframe$text_column)  )
 4. General question: How to speed up string operations on ‘large' data sets?
 
 
 Please let me know what more information you need in order to reproduce 
 this example?
 It’s more a general type of question, while I think the description above 
 gives you a specific picture of what I’m doing right now.
 
 
 
 
 
 
 General question:
 Am 05.11.2013 um 06:59 schrieb Jeff Newmiller jdnew...@dcn.davis.ca.us:
 
 Example not reproducible. Communication fail. Please refer to Posting 
 Guide.
 ---
 Jeff NewmillerThe .   .  Go Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
 Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
 ---
 Sent from my phone. Please excuse my brevity.
 
 Simon Pickert simon.pick...@t-online.de wrote:
 Hi R’lers,
 
 I’m running into speeding issues, performing a bunch of
 
 „gsub(patternvector, [token],dataframe$text_column)
 
 on a data frame containing 4millionentries.
 
 (The “patternvectors“ contain up to 500 elements)
 
 Is there any better/faster way than performing like 20 gsub commands in
 a row?
 
 
 Thanks!
 Simon
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible 

Re: [R] speed issue: gsub on large data frame

2013-11-05 Thread Carl Witthoft
My feeling is that the **result** you want is far more easily achievable via
a substitution table or a hash table.  Someone better versed in those areas
may want to chime in.  I'm thinking more or less of splitting your character
strings into vectors (separate elements at whitespace) and chunking away.

Something like  charvec[charvec==dataframe$text_column[k]] -
dataframe$replace_column[k]




Simon Pickert wrote
 Thanks everybody! Now I understand the need for more details:
 
 the patterns for the gsubs are of different kinds.First, I have character
 strings, I need to replace. Therefore, I have around 5000 stock ticker
 symbols (e.g. c(‚AAPL’, ‚EBAY’,…) distributed across 10 vectors. 
 Second, I have four vectors with regular expressions, all similar to this
 on: replace_url - c(„https?://.*\\s|www.*\\s“) 
 
 The text strings I perform the gsub commands on, look like this (no string
 is longer than 200 characters):
 
 'GOOGL announced new partnership www.url.com. Stock price is up +5%‘
 
 After performing several gsubs in a row, like
 
 gsub(replace_url, “[url]“,dataframe$text_column) 
 gsub(replace_ticker_sp500, “[sp500_ticker]“,dataframe$text_column) 
 etc. 
 
 this string will look like this:
 
 '[sp500_ticker] announced new partnership [url]. Stock price is up
 [positive_percentage]‘





--
View this message in context: 
http://r.789695.n4.nabble.com/speed-issue-gsub-on-large-data-frame-tp4679747p4679769.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed issue: gsub on large data frame

2013-11-04 Thread Simon Pickert
Hi R’lers,

I’m running into speeding issues, performing a bunch of 

„gsub(patternvector, [token],dataframe$text_column)

on a data frame containing 4millionentries.

(The “patternvectors“ contain up to 500 elements) 

Is there any better/faster way than performing like 20 gsub commands in a row?


Thanks!
Simon

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed issue: gsub on large data frame

2013-11-04 Thread Jeff Newmiller
Example not reproducible. Communication fail. Please refer to Posting Guide.
---
Jeff NewmillerThe .   .  Go Live...
DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
--- 
Sent from my phone. Please excuse my brevity.

Simon Pickert simon.pick...@t-online.de wrote:
Hi R’lers,

I’m running into speeding issues, performing a bunch of 

„gsub(patternvector, [token],dataframe$text_column)

on a data frame containing 4millionentries.

(The “patternvectors“ contain up to 500 elements) 

Is there any better/faster way than performing like 20 gsub commands in
a row?


Thanks!
Simon

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-29 Thread Arnaud Mosnier
Thanks Brian, I thought that forking clusters was better ... but as you
mentioned, it is not available on windows.
Unfortunately, you do not always choose the OS used by your company !

Arnaud



Date: Mon, 28 Oct 2013 17:59:10 +
From: Prof Brian Ripley rip...@stats.ox.ac.uk
To: r-help@r-project.org
Subject: Re: [R] speed of makeCluster (package parallel)
Message-ID: 526ea5ee.9060...@stats.ox.ac.uk
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 28/10/2013 16:19, Arnaud Mosnier wrote:
 Hi all,

 I am quite new in the world of parallelization and I wonder if there is a
 way to increase the speed of creation of a parallel socket cluster. The
 time spend to include threads increase exponentially with the number of

It increases linearly in my tests (or a decent OS).  But really if
parallel computing is worthwhile you will be doing minutes of work on
each worker process and the startup time will not be signifcant.

 thread considered and I use of computer with two 8 cores CPU and thus
 showing a total of 32 threads in windows 7.

The first way to speed things up: use a decent OS:  forking clusters is
much faster.

 Currently, I use the default parameters (type = PSOCK), but is there any
 fine tuning parameters that I can use to take advantage of this system ?

 Thanks in advance for your help !

 Arnaud

 R version 3.0.1 (2013-05-16)
 Platform: x86_64-w64-mingw32/x64 (64-bit)

   [[alternative HTML version deleted]]


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed of makeCluster (package parallel)

2013-10-28 Thread Arnaud Mosnier
Hi all,

I am quite new in the world of parallelization and I wonder if there is a
way to increase the speed of creation of a parallel socket cluster. The
time spend to include threads increase exponentially with the number of
thread considered and I use of computer with two 8 cores CPU and thus
showing a total of 32 threads in windows 7.
Currently, I use the default parameters (type = PSOCK), but is there any
fine tuning parameters that I can use to take advantage of this system ?

Thanks in advance for your help !

Arnaud

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-28 Thread Simon Zehnder
See library(help = parallel”)


On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote:

 Hi all,
 
 I am quite new in the world of parallelization and I wonder if there is a
 way to increase the speed of creation of a parallel socket cluster. The
 time spend to include threads increase exponentially with the number of
 thread considered and I use of computer with two 8 cores CPU and thus
 showing a total of 32 threads in windows 7.
 Currently, I use the default parameters (type = PSOCK), but is there any
 fine tuning parameters that I can use to take advantage of this system ?
 
 Thanks in advance for your help !
 
 Arnaud
 
 R version 3.0.1 (2013-05-16)
 Platform: x86_64-w64-mingw32/x64 (64-bit)
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-28 Thread Arnaud Mosnier
Thanks Simon,

I already read the parallel vignette but I did not found what I wanted.
May be you can be more specific on a part of the document that can provide
me hints !

Arnaud


2013/10/28 Simon Zehnder szehn...@uni-bonn.de

 See library(help = parallel”)


 On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote:

  Hi all,
 
  I am quite new in the world of parallelization and I wonder if there is a
  way to increase the speed of creation of a parallel socket cluster. The
  time spend to include threads increase exponentially with the number of
  thread considered and I use of computer with two 8 cores CPU and thus
  showing a total of 32 threads in windows 7.
  Currently, I use the default parameters (type = PSOCK), but is there
 any
  fine tuning parameters that I can use to take advantage of this system ?
 
  Thanks in advance for your help !
 
  Arnaud
 
  R version 3.0.1 (2013-05-16)
  Platform: x86_64-w64-mingw32/x64 (64-bit)
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-28 Thread Simon Zehnder
First,

use only the number of cores as a number of thread - i.e. I would not use hyper 
threading, etc.. Each core has its own caches and it is always fortunate if a 
process has enough memory; hyper threads use all the same cache on the core 
they are running on. detectCores() gives me for example 4 - I know I have 2. I 
would therefore call makeCluster() with nnode = 2. mcaffinity() lets you 
perform a technique called process-pinning (see process affinity) and is only 
possible if the OS supports it. It makes sometimes sense to assign certain 
processes to certain CPUs such that each process has enough memory in caches 
(e.g. for a 16 Core machine using 8 processes on CPUs 1, 3, 5, 7, 9, 11, 13 and 
15; so each process has the cache of two CPUs). 
A lot of functions though are not available for Windows. 

At first it comes always the problem you want to solve and then you look how 
much memory will be used in a process and how much you have (more often the 
memory bandwidth is the bottleneck and not the computing power). Look at the 
architecture of your chips (how much L1 Cache, how much L2 cache). Then you 
decide how many cores to use and if it makes sense to pin processes to certain 
cores. 

There are no general recipes for parallel computing - each problem is 
different. Some problems are even not scalable. 

Simon


On 28 Oct 2013, at 17:51, Arnaud Mosnier a.mosn...@gmail.com wrote:

 Thanks Simon,
 
 I already read the parallel vignette but I did not found what I wanted.
 May be you can be more specific on a part of the document that can provide me 
 hints !
 
 Arnaud
 
 
 2013/10/28 Simon Zehnder szehn...@uni-bonn.de
 See library(help = parallel”)
 
 
 On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote:
 
  Hi all,
 
  I am quite new in the world of parallelization and I wonder if there is a
  way to increase the speed of creation of a parallel socket cluster. The
  time spend to include threads increase exponentially with the number of
  thread considered and I use of computer with two 8 cores CPU and thus
  showing a total of 32 threads in windows 7.
  Currently, I use the default parameters (type = PSOCK), but is there any
  fine tuning parameters that I can use to take advantage of this system ?
 
  Thanks in advance for your help !
 
  Arnaud
 
  R version 3.0.1 (2013-05-16)
  Platform: x86_64-w64-mingw32/x64 (64-bit)
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-28 Thread Prof Brian Ripley

On 28/10/2013 16:19, Arnaud Mosnier wrote:

Hi all,

I am quite new in the world of parallelization and I wonder if there is a
way to increase the speed of creation of a parallel socket cluster. The
time spend to include threads increase exponentially with the number of


It increases linearly in my tests (or a decent OS).  But really if 
parallel computing is worthwhile you will be doing minutes of work on 
each worker process and the startup time will not be signifcant.



thread considered and I use of computer with two 8 cores CPU and thus
showing a total of 32 threads in windows 7.


The first way to speed things up: use a decent OS:  forking clusters is 
much faster.



Currently, I use the default parameters (type = PSOCK), but is there any
fine tuning parameters that I can use to take advantage of this system ?

Thanks in advance for your help !

Arnaud

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[[alternative HTML version deleted]]



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of makeCluster (package parallel)

2013-10-28 Thread Arnaud Mosnier
Thanks a lot Simon, that's useful.
I will take a look at this process-pinning things.

Arnaud


2013/10/28 Simon Zehnder szehn...@uni-bonn.de

 First,

 use only the number of cores as a number of thread - i.e. I would not use
 hyper threading, etc.. Each core has its own caches and it is always
 fortunate if a process has enough memory; hyper threads use all the same
 cache on the core they are running on. detectCores() gives me for example 4
 - I know I have 2. I would therefore call makeCluster() with nnode = 2.
 mcaffinity() lets you perform a technique called process-pinning (see
 process affinity) and is only possible if the OS supports it. It makes
 sometimes sense to assign certain processes to certain CPUs such that each
 process has enough memory in caches (e.g. for a 16 Core machine using 8
 processes on CPUs 1, 3, 5, 7, 9, 11, 13 and 15; so each process has the
 cache of two CPUs).
 A lot of functions though are not available for Windows.

 At first it comes always the problem you want to solve and then you look
 how much memory will be used in a process and how much you have (more often
 the memory bandwidth is the bottleneck and not the computing power). Look
 at the architecture of your chips (how much L1 Cache, how much L2 cache).
 Then you decide how many cores to use and if it makes sense to pin
 processes to certain cores.

 There are no general recipes for parallel computing - each problem is
 different. Some problems are even not scalable.

 Simon


 On 28 Oct 2013, at 17:51, Arnaud Mosnier a.mosn...@gmail.com wrote:

  Thanks Simon,
 
  I already read the parallel vignette but I did not found what I wanted.
  May be you can be more specific on a part of the document that can
 provide me hints !
 
  Arnaud
 
 
  2013/10/28 Simon Zehnder szehn...@uni-bonn.de
  See library(help = parallel”)
 
 
  On 28 Oct 2013, at 17:19, Arnaud Mosnier a.mosn...@gmail.com wrote:
 
   Hi all,
  
   I am quite new in the world of parallelization and I wonder if there
 is a
   way to increase the speed of creation of a parallel socket cluster. The
   time spend to include threads increase exponentially with the number of
   thread considered and I use of computer with two 8 cores CPU and thus
   showing a total of 32 threads in windows 7.
   Currently, I use the default parameters (type = PSOCK), but is there
 any
   fine tuning parameters that I can use to take advantage of this system
 ?
  
   Thanks in advance for your help !
  
   Arnaud
  
   R version 3.0.1 (2013-05-16)
   Platform: x86_64-w64-mingw32/x64 (64-bit)
  
 [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
 
 



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up a function

2013-07-15 Thread Santiago Guallar
Dear Petr,

Sorry for the delay. I've been out.
Unfortunately, your code doesn't work either even when using fromLast = T.
Thank you for your help and your time.

Santi




 From: PIKAL Petr petr.pi...@precheza.cz
To: Santiago Guallar sgual...@yahoo.com 
Cc: r-help r-help@r-project.org 
Sent: Wednesday, July 10, 2013 8:35 AM
Subject: RE: [R] spped up a function
 


 
Hi Santiago
 
Keep conversation in list. Others can have better ideas.
 
I am still messing the reasoning
 
Merge seems to me the solution but I am lost in your resoning what to keep and 
what to discard from resulting object.
 
After merge I have this
 
result - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756,
6140855, 6143070, 6143090, 6143093, 6175711, 6175726,
6175730, 6175769, 6175776, 6175784, 6188609, 6188705,
6195159, 6195171, 6198153, 6198154, 6198156, 6198157,
6198172), class = factor), jul = c(15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135,
15135, 15135), timepos = structure(c(1307680575, 1307680740,
1307681040, 1307681340, 1307681640, 1307681940, 1307682240, 1307682540,
1307682780, 1307683080, 1307683380, 1307683680, 1307683980, 1307684280,
1307684397, 1307684424, 1307684484, 1307684490, 1307684580, 1307684880,
1307685180, 1307685243, 1307685321, 1307685336), class = c(POSIXct,
POSIXt), tzone = GMT), act = c(3822L, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 27L, 60L, 6L, 753L, NA, NA, NA,
78L, 15L, 18L), wd = c(dry, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, wet, dry, wet, dry, NA, NA, NA, wet,
dry, wet)), .Names = c(Ring, jul, timepos, act, wd
), row.names = c(NA, -24L), class = data.frame)
 
 result
      Ring   jul             timepos  act   wd
1  6106933 15135 2011-06-10 04:36:15 3822  dry
2  6106933 15135 2011-06-10 04:39:00   NA NA
3  6106933 15135 2011-06-10 04:44:00   NA NA
4  6106933 15135 2011-06-10 04:49:00   NA NA
5  6106933 15135 2011-06-10 04:54:00   NA NA
6  6106933 15135 2011-06-10 04:59:00   NA NA
7  6106933 15135 2011-06-10 05:04:00   NA NA
8  6106933 15135 2011-06-10 05:09:00   NA NA
9  6106933 15135 2011-06-10 05:13:00   NA NA
10 6106933 15135 2011-06-10 05:18:00   NA NA
11 6106933 15135 2011-06-10 05:23:00   NA NA
12 6106933 15135 2011-06-10 05:28:00   NA NA
13 6106933 15135 2011-06-10 05:33:00   NA NA
14 6106933 15135 2011-06-10 05:38:00   NA NA
15 6106933 15135 2011-06-10 05:39:57   27  wet
16 6106933 15135 2011-06-10 05:40:24   60  dry
17 6106933 15135 2011-06-10 05:41:24    6  wet
18 6106933 15135 2011-06-10 05:41:30  753  dry
19 6106933 15135 2011-06-10 05:43:00   NA NA
20 6106933 15135 2011-06-10 05:48:00   NA NA
21 6106933 15135 2011-06-10 05:53:00   NA NA
22 6106933 15135 2011-06-10 05:54:03   78  wet
23 6106933 15135 2011-06-10 05:55:21   15  dry
24 6106933 15135 2011-06-10 05:55:36   18  wet
 
I understand you want to keep only time values from GPL data.frame. OK this 
can be done in the last step. But I am a bit lost in the logic for discarding 
lines 15-18. Anyway, this can be what you want
 
library(zoo)
result$wd-na.locf(result$wd)
final-result[is.na(result$act),]
 final
      Ring   jul             timepos act  wd
2  6106933 15135 2011-06-10 04:39:00  NA dry
3  6106933 15135 2011-06-10 04:44:00  NA dry
4  6106933 15135 2011-06-10 04:49:00  NA dry
5  6106933 15135 2011-06-10 04:54:00  NA dry
6  6106933 15135 2011-06-10 04:59:00  NA dry
7  6106933 15135 2011-06-10 05:04:00  NA dry
8  6106933 15135 2011-06-10 05:09:00  NA dry
9  6106933 15135 2011-06-10 05:13:00  NA dry
10 6106933 15135 2011-06-10 05:18:00  NA dry
11 6106933 15135 2011-06-10 05:23:00  NA dry
12 6106933 15135 2011-06-10 05:28:00  NA dry
13 6106933 15135 2011-06-10 05:33:00  NA dry
14 6106933 15135 2011-06-10 05:38:00  NA dry
19 6106933 15135 2011-06-10 05:43:00  NA dry
20 6106933 15135 2011-06-10 05:48:00  NA dry
21 6106933 15135 2011-06-10 05:53:00  NA dry
  
 
Regards
Petr
 
From:Santiago Guallar [mailto:sgual...@yahoo.com] 
Sent: Tuesday, July 09, 2013 10:02 PM
To: PIKAL Petr
Subject: Re: [R] spped up a function
 
Dear Petr,
 
I wanted the two data sets merged in such a way that the values of the 'wd' 
vector (from the intervals t of 'xact') are assigned to the corresponding 
intervals of 'GPS'. If there is more than one value (i.e if there is more than 
one interval of 'xact' for the corresponding interval of 'GPS'), then take the 
maximum (i.e. the value of the interval of 'xact' closest to the corresponding 
interval of 'GPS'). This is why the output of the particular sequence of the 
result I copied in the previous message contains only 'dry'.
 
Santi
 
 
From:PIKAL Petr petr.pi...@precheza.cz

Re: [R] speed up a function

2013-07-15 Thread PIKAL Petr
Hm, so you probably ask to get something what you actually do not want.

AFAIK what I called „final“ is the same as you asked for with your  toy 
data except of column “act” which you can easily get rid of.

If what I suggested does not work with your real data you shall prepare better 
example, with which my suggestion does not give desired results.

Regards
Petr


GPS  - structure(list(Ring = c(6106933L, 6106933L, 6106933L, 6106933L,
6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L, 6106933L,
6106933L, 6106933L, 6106933L, 6106933L, 6106933L), jul = c(15135,
15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135, 15135, 15135), timepos = structure(c(1307680740,
1307681040, 1307681340, 1307681640, 1307681940, 1307682240, 1307682540,
1307682780, 1307683080, 1307683380, 1307683680, 1307683980, 1307684280,
1307684580, 1307684880, 1307685180), class = c(POSIXct, POSIXt
), tzone = GMT)), .Names = c(Ring, jul, timepos), row.names = c(5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20), class = data.frame)

xact - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756,
6140855, 6143070, 6143090, 6143093, 6175711, 6175726,
6175730, 6175769, 6175776, 6175784, 6188609, 6188705,
6195159, 6195171, 6198153, 6198154, 6198156, 6198157,
6198172), class = factor), jul = c(15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135), timepos = structure(c(1307680575,
1307684397, 1307684424, 1307684484, 1307684490, 1307685243, 1307685321,
1307685336), class = c(POSIXct, POSIXt), tzone = GMT),
act = c(3822L, 27L, 60L, 6L, 753L, 78L, 15L, 18L), wd = c(dry,
wet, dry, wet, dry, wet, dry, wet)), .Names = c(Ring,
jul, timepos, act, wd), row.names = 170:177, class = data.frame)

GPS$Ring-factor(GPS$Ring)
result-merge(xact, GPS, all=T)
library(zoo)
result$wd-na.locf(result$wd)
final-result[is.na(result$act),]
final

 final
  Ring   jul timepos act  wd
2  6106933 15135 2011-06-10 04:39:00  NA dry
3  6106933 15135 2011-06-10 04:44:00  NA dry
4  6106933 15135 2011-06-10 04:49:00  NA dry
5  6106933 15135 2011-06-10 04:54:00  NA dry
6  6106933 15135 2011-06-10 04:59:00  NA dry
7  6106933 15135 2011-06-10 05:04:00  NA dry
8  6106933 15135 2011-06-10 05:09:00  NA dry
9  6106933 15135 2011-06-10 05:13:00  NA dry
10 6106933 15135 2011-06-10 05:18:00  NA dry
11 6106933 15135 2011-06-10 05:23:00  NA dry
12 6106933 15135 2011-06-10 05:28:00  NA dry
13 6106933 15135 2011-06-10 05:33:00  NA dry
14 6106933 15135 2011-06-10 05:38:00  NA dry
19 6106933 15135 2011-06-10 05:43:00  NA dry
20 6106933 15135 2011-06-10 05:48:00  NA dry
21 6106933 15135 2011-06-10 05:53:00  NA dry

This is what you have asked for. Seems the same to me.

head(GPS1, 16) and desired result (added column wd)
  Ring   jul timepos wd
5  6106933 15135 2011-06-10 04:39:00 dry
6  6106933 15135 2011-06-10 04:44:00 dry
7  6106933 15135 2011-06-10 04:49:00 dry
8  6106933 15135 2011-06-10 04:54:00 dry
9  6106933 15135 2011-06-10 04:59:00 dry
10 6106933 15135 2011-06-10 05:04:00 dry
11 6106933 15135 2011-06-10 05:09:00 dry
12 6106933 15135 2011-06-10 05:13:00 dry
13 6106933 15135 2011-06-10 05:18:00 dry
14 6106933 15135 2011-06-10 05:23:00 dry
15 6106933 15135 2011-06-10 05:28:00 dry
16 6106933 15135 2011-06-10 05:33:00 dry
17 6106933 15135 2011-06-10 05:38:00 dry
18 6106933 15135 2011-06-10 05:43:00 dry
19 6106933 15135 2011-06-10 05:48:00 dry
20 6106933 15135 2011-06-10 05:53:00 dry


Petr Pikal

From: Santiago Guallar [mailto:sgual...@yahoo.com]
Sent: Monday, July 15, 2013 4:29 PM
To: PIKAL Petr
Cc: r-help
Subject: Re: [R] speed up a function

Dear Petr,


Sorry for the delay. I've been out.
Unfortunately, your code doesn't work either even when using fromLast = T.
Thank you for your help and your time.

Santi



From: PIKAL Petr petr.pi...@precheza.czmailto:petr.pi...@precheza.cz
To: Santiago Guallar sgual...@yahoo.commailto:sgual...@yahoo.com
Cc: r-help r-help@r-project.orgmailto:r-help@r-project.org
Sent: Wednesday, July 10, 2013 8:35 AM
Subject: RE: [R] spped up a function

Hi Santiago

Keep conversation in list. Others can have better ideas.

I am still messing the reasoning

Merge seems to me the solution but I am lost in your resoning what to keep and 
what to discard from resulting object.

After merge I have this

result - structure(list(Ring = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c(6106933, 6134701, 6140497, 6140719, 6140756,
6140855, 6143070, 6143090, 6143093, 6175711, 6175726,
6175730, 6175769, 6175776, 6175784, 6188609, 6188705,
6195159, 6195171, 6198153, 6198154, 6198156, 6198157,
6198172), class = factor), jul = c(15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135,
15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135, 15135,
15135, 15135), timepos

[R] Speed up or alternative to 'For' loop

2013-06-10 Thread Trevor Walker
I have a For loop that is quite slow and am wondering if there is a faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }

Trevor Walker
Email: trevordaviswal...@gmail.com

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread Rui Barradas

Hello,

One way to speed it up is to use a matrix instead of a data.frame. Since 
data.frames can hold data of all classes, the access to their elements 
is slow. And your data is all numeric so it can be hold in a matrix. The 
second way below gave me a speed up by a factor of 50.



system.time({
for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }
})

system.time({
df2 - data.matrix(df)
for(i in seq_len(nrow(df2))[-1]){
if(df2[i, TreeID] == df2[i - 1, TreeID])
df2[i, HeightGrowth] - df2[i, Height] - df2[i - 1, 
Height]
}
})

all.equal(df, as.data.frame(df2))  # TRUE


Hope this helps,

Rui Barradas

Em 10-06-2013 18:28, Trevor Walker escreveu:

I have a For loop that is quite slow and am wondering if there is a faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
  {if(df$TreeID[i]==df$TreeID[i-1])
   {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
   }
  }

Trevor Walker
Email: trevordaviswal...@gmail.com

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread MacQueen, Don
How about

for (ir in unique(df$TreeID)) {
  in.ir - df$TreeID == ir
  df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir])
}

Seemed fast enough to me.

In R, it is generally good to look for ways to operate on entire vectors
or arrays, rather than element by element within them. The cumsum()
function does that in this example.

-Don


-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote:

I have a For loop that is quite slow and am wondering if there is a faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }

Trevor Walker
Email: trevordaviswal...@gmail.com

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread David Winsemius

On Jun 10, 2013, at 10:28 AM, Trevor Walker wrote:

 I have a For loop that is quite slow and am wondering if there is a faster
 option:
 
 df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
 df$Height - exp(-0.1 + 0.2*df$Age)
 df$HeightGrowth - NA   #intialize with NA
 for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }
 
Ivoid tests with if(){}e;se(). Use vectorized code, possibly with 'ifelse' but 
in this case you need a function that does calcualtions within groups.

The ave() function with diff() will do it compactly and efficiently:

 df - data.frame(TreeID=rep(1:5,each=4), Age=rep(seq(1,4,1),5))
 df$Height - exp(-0.1 + 0.2*df$Age)
 df$HeightGrowth - NA   #intialize with NA

 df$HeightGrowth - ave(df$Height, df$TreeID, FUN= function(vec) c(NA, 
 diff(vec)))
 df
   TreeID Age   Height HeightGrowth
1   1   1 1.105171   NA
2   1   2 1.3498590.2446879
3   1   3 1.6487210.2988625
4   1   4 2.0137530.3650314
5   2   1 1.105171   NA
6   2   2 1.3498590.2446879
7   2   3 1.6487210.2988625
8   2   4 2.0137530.3650314
9   3   1 1.105171   NA
10  3   2 1.3498590.2446879
11  3   3 1.6487210.2988625
12  3   4 2.0137530.3650314
13  4   1 1.105171   NA
14  4   2 1.3498590.2446879
15  4   3 1.6487210.2988625
16  4   4 2.0137530.3650314
17  5   1 1.105171   NA
18  5   2 1.3498590.2446879
19  5   3 1.6487210.2988625
20  5   4 2.0137530.3650314

(On my machine it was over six times as fast as the if-based code from Arun. )

-- 

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread MacQueen, Don
Sorry, it looks like I was hasty.
Absent another dumb mistake, the following should do it.

The request was for differences, i.e., the amount of growth from one
period to the next, separately for each tree.

for (ir in unique(df$TreeID)) {
  in.ir - df$TreeID == ir
  df$HeightGrowth[in.ir] - c(NA, diff(df$Height[in.ir]))
}



And this gives the same result as Rui Barradas' previous response.

-Don

-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 6/10/13 2:51 PM, MacQueen, Don macque...@llnl.gov wrote:

How about

for (ir in unique(df$TreeID)) {
  in.ir - df$TreeID == ir
  df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir])
}

Seemed fast enough to me.

In R, it is generally good to look for ways to operate on entire vectors
or arrays, rather than element by element within them. The cumsum()
function does that in this example.

-Don


-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote:

I have a For loop that is quite slow and am wondering if there is a
faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }

Trevor Walker
Email: trevordaviswal...@gmail.com

  [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread MacQueen, Don
Well, speaking of hasty...

This will also do it, provided that each tree's initial height is less
than the previous tree's final height. In principle, not a safe
assumption, but might be ok depending on where the data came from.

df$delta - c(NA,diff(df$Height))
df$delta[df$delta  0] - NA

-Don



-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 6/10/13 2:51 PM, MacQueen, Don macque...@llnl.gov wrote:

How about

for (ir in unique(df$TreeID)) {
  in.ir - df$TreeID == ir
  df$HeightGrowth[in.ir] - cumsum(df$Height[in.ir])
}

Seemed fast enough to me.

In R, it is generally good to look for ways to operate on entire vectors
or arrays, rather than element by element within them. The cumsum()
function does that in this example.

-Don


-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 6/10/13 10:28 AM, Trevor Walker trevordaviswal...@gmail.com wrote:

I have a For loop that is quite slow and am wondering if there is a
faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
 {if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
 }

Trevor Walker
Email: trevordaviswal...@gmail.com

  [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up or alternative to 'For' loop

2013-06-10 Thread arun
Hi,
Some speed comparisons:


df - data.frame(TreeID=rep(1:6000,each=20), Age=rep(seq(1,20,1),6000))
df$Height - exp(-0.1 + 0.2*df$Age)
df1- df
df3-df
library(data.table)
dt1- data.table(df)
df$HeightGrowth - NA 


system.time({  #Rui's 2nd function
df2 - data.matrix(df)
for(i in seq_len(nrow(df2))[-1]){
    if(df2[i, TreeID] == df2[i - 1, TreeID])
        df2[i, HeightGrowth] - df2[i, Height] - df2[i - 1, Height]
}
})
# user  system elapsed 
 # 1.108   0.000   1.109 


system.time({for (ir in unique(df$TreeID)) {   #Don's first function
  in.ir - df$TreeID == ir
  df$HeightGrowth[in.ir] - c(NA, diff(df$Height[in.ir]))
}})
#  user  system elapsed 
#100.004   0.704 100.903 

system.time({df3$delta - c(NA,diff(df3$Height)) ##Don's 2nd function
df3$delta[df3$delta  0] - NA}) #winner 
#   user  system elapsed 
 # 0.016   0.000   0.014 

system.time(df1$HeightGrowth - ave(df1$Height, df1$TreeID, FUN= function(vec) 
c(NA, diff(vec #David's
 #user  system elapsed 
 # 0.136   0.000   0.137 
 system.time(dt1[,HeightGrowth:=c(NA,diff(Height)),by=TreeID])
#  user  system elapsed 
 # 0.076   0.000   0.079 


 identical(df1,as.data.frame(dt1))
#[1] TRUE
 identical(df1,df)
#[1] TRUE


head(df1,2)
#  TreeID Age   Height HeightGrowth
#1  1   1 1.105171   NA
#2  1   2 1.349859    0.2446879
head(df2,2)
# TreeID Age   Height HeightGrowth
#[1,]  1   1 1.105171   NA
#[2,]  1   2 1.349859    0.2446879

A.K.



- Original Message -
From: Trevor Walker trevordaviswal...@gmail.com
To: r-help@r-project.org
Cc: 
Sent: Monday, June 10, 2013 1:28 PM
Subject: [R] Speed up or alternative to 'For' loop

I have a For loop that is quite slow and am wondering if there is a faster
option:

df - data.frame(TreeID=rep(1:500,each=20), Age=rep(seq(1,20,1),500))
df$Height - exp(-0.1 + 0.2*df$Age)
df$HeightGrowth - NA   #intialize with NA
for (i in 2:nrow(df))
{if(df$TreeID[i]==df$TreeID[i-1])
  {df$HeightGrowth[i] - df$Height[i]-df$Height[i-1]
  }
}

Trevor Walker
Email: trevordaviswal...@gmail.com

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-29 Thread Mikhail Umorin

Thank you all very much for your time and suggestions. The link to 
stackoverflow was very helpful. Here are some timings in case someone wants to 
know. (I noticed that microbenchmark results vary, depending on how many 
functions one tries to benchmark at a time. However, the min stays about the 
same)

# just to refresh, most of the code is from stackoverflow link provided by 
Martin Morgan  : http://stackoverflow.com/questions/16213029/more-efficient-
strategy-for-which-or-match

f0 - function(v) length(which(v  0))

f1 - function(v) sum(v  0)

f2 - function(v) which.min(v  0) - 1L

f3 - function(x) { # binary search implemented in R
imin - 1L
imax - length(x)
while (imax = imin) {
imid - as.integer(imin + (imax - imin) / 2)
if (x[imid] = 0)
imax - imid - 1L
else
imin - imid + 1L
}
imax
}

f3.c - cmpfun(f3) # pre-compiled

# binary search in C
f4 - cfunction(c(x = numeric),  
int imin = 0, imax = Rf_length(x) - 1, imid;
while (imax = imin) {
imid = imin + (imax - imin) / 2;
if (REAL(x)[imid] = 0)
imax = imid - 1;
else
imin = imid + 1;
}
return ScalarInteger(imax + 1);
)

# this one is separate suggestion by William Dunlap :
f5 - function(v) {
  tabulate(findInterval(v, c(-Inf, 0, 1, Inf)))[1]
}

vec - c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
# the identity of results was verified

microbenchmark(f1(vec), f2(vec), f3(vec), f3.c(vec), f4(vec), f5(vec))
Unit: microseconds
  expr   min lqmedian uq   max neval
   f1(vec) 17054.233 17831.1385 18514.305 19512.4705 54603.435   100
   f2(vec) 23624.353 25026.4265 26034.785 29322.1150 60014.458   100
   f3(vec)76.90293.2340   111.834   116.8370   129.888   100
 f3.c(vec)21.88330.753037.75754.125062.939   100
   f4(vec) 6.57510.588530.38931.938537.610   100
   f5(vec) 35365.088 36767.6175 38317.103 40671.2000 69209.425   100


So, i'll try to go with the inline binary search and see if I can precompile 
complex conditions.

Thank you, again, for your help!

Mikhail.




On Friday, April 26, 2013 20:52:27 Suzen, Mehmet wrote:
 Hello Mikhail,
 
 I could suggest you to use ff package for fast access to large data
 structures:
 
 http://cran.r-project.org/web/packages/ff/index.html
 http://wsopuppenkiste.wiso.uni-goettingen.de/ff/ff_1.0/inst/doc/ff.pdf
 
 Best
 
 Mehmet
 
 On 26 April 2013 18:12, Mikhail Umorin mike...@gmail.com wrote:
  Hello,
  
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are sorted (with duplicates) in the vector (v). I am obtaining the length
  of vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar variables. What is the most efficient way to do this?
  
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to me more efficient than length(which(v  c)), but, please, correct me
  if I'm wrong. So, is there anything faster than what I already use?
  
  I'm running R 2.14.2 on Linux kernel 3.4.34.
  
  I appreciate your time,
  
  Mikhail
  
  [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html and provide commented,
  minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed of a vector operation question

2013-04-26 Thread Mikhail Umorin
Hello, 

I am dealing with numeric vectors 10^5 to 10^6 elements long. The values are 
sorted (with duplicates) in the vector (v). I am obtaining the length of 
vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some scalar 
variables. What is the most efficient way to do this?

I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems to 
me more efficient than length(which(v  c)), but, please, correct me if I'm 
wrong. So, is there anything faster than what I already use?

I'm running R 2.14.2 on Linux kernel 3.4.34.

I appreciate your time, 

Mikhail
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread lcn
I think the sum way is the best.


On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:

 Hello,

 I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
 are
 sorted (with duplicates) in the vector (v). I am obtaining the length of
 vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
 scalar
 variables. What is the most efficient way to do this?

 I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
 to
 me more efficient than length(which(v  c)), but, please, correct me if I'm
 wrong. So, is there anything faster than what I already use?

 I'm running R 2.14.2 on Linux kernel 3.4.34.

 I appreciate your time,

 Mikhail
 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread William Dunlap

 I think the sum way is the best.

On my Linux machine running R-3.0.0 the sum way is slightly faster:
   x - rexp(1e6, 2)
   system.time(for(i in 1:100)sum(x.3  x.5))
 user  system elapsed
4.664   0.340   5.018
   system.time(for(i in 1:100)length(which(x.3  x.5)))
 user  system elapsed
5.017   0.160   5.186

If you are doing many of these counts on the same dataset you
can save time by using functions like cut(), table(), ecdf(), and
findInterval().  E.g.,
 system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
 x=i), FUN.VALUE=0L))
   user  system elapsed
  5.332   0.568   5.909
 system.time(r2 - table(cut(x, seq(0,1,by=1/128
   user  system elapsed
  0.500   0.008   0.511
 all.equal(as.vector(r1), as.vector(r2))
[1] TRUE

You should do the timings yourself, as the relative speeds will depend
on the version or dialect of  the R interpreter and how it was compiled.
E.g., with the current development version of 'TIBCO Enterprise Runtime for R' 
(aka 'TERR')
on this same 8-core Linux box the sum way is considerably faster then
the length(which) way:
   x - rexp(1e6, 2)
   system.time(for(i in 1:100)sum(x.3  x.5))
 user  system elapsed
 1.870.030.48
   system.time(for(i in 1:100)length(which(x.3  x.5)))
 user  system elapsed
 3.210.040.83
   system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) 
 x=i), FUN.VALUE=0L))
 user  system elapsed
 2.190.040.56
   system.time(r2 - table(cut(x, seq(0,1,by=1/128
 user  system elapsed
 0.270.010.13
   all.equal(as.vector(r1), as.vector(r2))
  [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf
 Of lcn
 Sent: Friday, April 26, 2013 12:09 PM
 To: Mikhail Umorin
 Cc: r-help@r-project.org
 Subject: Re: [R] speed of a vector operation question
 
 I think the sum way is the best.
 
 
 On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:
 
  Hello,
 
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are
  sorted (with duplicates) in the vector (v). I am obtaining the length of
  vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar
  variables. What is the most efficient way to do this?
 
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to
  me more efficient than length(which(v  c)), but, please, correct me if I'm
  wrong. So, is there anything faster than what I already use?
 
  I'm running R 2.14.2 on Linux kernel 3.4.34.
 
  I appreciate your time,
 
  Mikhail
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread Martin Morgan
A very similar question was asked on StackOverflow (by Mikhail? and then I guess 
the answers there were somehow not satisfactory...)



http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-match

where it turns out that a binary search (implemented in R) on the sorted vector 
is much faster than sum, etc. I guess because it's log N without copying. The 
more complicated condition x  .3  x  .5 could be satisfied with multiple 
calls to the search.


Martin

On 04/26/2013 01:20 PM, William Dunlap wrote:



I think the sum way is the best.


On my Linux machine running R-3.0.0 the sum way is slightly faster:
x - rexp(1e6, 2)
system.time(for(i in 1:100)sum(x.3  x.5))
  user  system elapsed
 4.664   0.340   5.018
system.time(for(i in 1:100)length(which(x.3  x.5)))
  user  system elapsed
 5.017   0.160   5.186

If you are doing many of these counts on the same dataset you
can save time by using functions like cut(), table(), ecdf(), and
findInterval().  E.g.,

system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
x=i), FUN.VALUE=0L))

user  system elapsed
   5.332   0.568   5.909

system.time(r2 - table(cut(x, seq(0,1,by=1/128

user  system elapsed
   0.500   0.008   0.511

all.equal(as.vector(r1), as.vector(r2))

[1] TRUE

You should do the timings yourself, as the relative speeds will depend
on the version or dialect of  the R interpreter and how it was compiled.
E.g., with the current development version of 'TIBCO Enterprise Runtime for R' 
(aka 'TERR')
on this same 8-core Linux box the sum way is considerably faster then
the length(which) way:
x - rexp(1e6, 2)
system.time(for(i in 1:100)sum(x.3  x.5))
  user  system elapsed
  1.870.030.48
system.time(for(i in 1:100)length(which(x.3  x.5)))
  user  system elapsed
  3.210.040.83
system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128)  
x=i), FUN.VALUE=0L))
  user  system elapsed
  2.190.040.56
system.time(r2 - table(cut(x, seq(0,1,by=1/128
  user  system elapsed
  0.270.010.13
all.equal(as.vector(r1), as.vector(r2))
   [1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com



-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf
Of lcn
Sent: Friday, April 26, 2013 12:09 PM
To: Mikhail Umorin
Cc: r-help@r-project.org
Subject: Re: [R] speed of a vector operation question

I think the sum way is the best.


On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:


Hello,

I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
are
sorted (with duplicates) in the vector (v). I am obtaining the length of
vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
scalar
variables. What is the most efficient way to do this?

I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
to
me more efficient than length(which(v  c)), but, please, correct me if I'm
wrong. So, is there anything faster than what I already use?

I'm running R 2.14.2 on Linux kernel 3.4.34.

I appreciate your time,

Mikhail
 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed of a vector operation question

2013-04-26 Thread William Dunlap
R's findInterval can also take advantage of a sorted x vector.  E.g.,
in R-3.0.0 on the same 8-core Linux box:

 x - rexp(1e6, 2)
 system.time(for(i in 1:100)tabulate(findInterval(x, c(-Inf, .3, .5, Inf)))[2])
   user  system elapsed
  2.444   0.000   2.446
 xs - sort(x)
 system.time(for(i in 1:100)tabulate(findInterval(xs, c(-Inf, .3, .5, 
 Inf)))[2])
   user  system elapsed
  1.472   0.000   1.475
 
 tabulate(findInterval(xs, c(-Inf, .3, .5, Inf)))[2]
[1] 180636
 sum( xs  .3  xs = .5 )
[1] 180636


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: Martin Morgan [mailto:mtmor...@fhcrc.org]
 Sent: Friday, April 26, 2013 1:33 PM
 To: William Dunlap
 Cc: lcn; Mikhail Umorin; r-help@r-project.org
 Subject: Re: [R] speed of a vector operation question
 
 A very similar question was asked on StackOverflow (by Mikhail? and then I 
 guess
 the answers there were somehow not satisfactory...)
 
 
 http://stackoverflow.com/questions/16213029/more-efficient-strategy-for-which-or-
 match
 
 where it turns out that a binary search (implemented in R) on the sorted 
 vector
 is much faster than sum, etc. I guess because it's log N without copying. The
 more complicated condition x  .3  x  .5 could be satisfied with multiple
 calls to the search.
 
 Martin
 
 On 04/26/2013 01:20 PM, William Dunlap wrote:
 
  I think the sum way is the best.
 
  On my Linux machine running R-3.0.0 the sum way is slightly faster:
  x - rexp(1e6, 2)
  system.time(for(i in 1:100)sum(x.3  x.5))
user  system elapsed
   4.664   0.340   5.018
  system.time(for(i in 1:100)length(which(x.3  x.5)))
user  system elapsed
   5.017   0.160   5.186
 
  If you are doing many of these counts on the same dataset you
  can save time by using functions like cut(), table(), ecdf(), and
  findInterval().  E.g.,
  system.time(r1 - vapply(seq(0,1,by=1/128)[-1], function(i)sum(x(i-1/128) 
   x=i),
 FUN.VALUE=0L))
  user  system elapsed
 5.332   0.568   5.909
  system.time(r2 - table(cut(x, seq(0,1,by=1/128
  user  system elapsed
 0.500   0.008   0.511
  all.equal(as.vector(r1), as.vector(r2))
  [1] TRUE
 
  You should do the timings yourself, as the relative speeds will depend
  on the version or dialect of  the R interpreter and how it was compiled.
  E.g., with the current development version of 'TIBCO Enterprise Runtime for 
  R' (aka
 'TERR')
  on this same 8-core Linux box the sum way is considerably faster then
  the length(which) way:
  x - rexp(1e6, 2)
  system.time(for(i in 1:100)sum(x.3  x.5))
user  system elapsed
1.870.030.48
  system.time(for(i in 1:100)length(which(x.3  x.5)))
user  system elapsed
3.210.040.83
  system.time(r1 - vapply(seq(0,1,by=1/128)[-1], 
  function(i)sum(x(i-1/128)  x=i),
 FUN.VALUE=0L))
user  system elapsed
2.190.040.56
  system.time(r2 - table(cut(x, seq(0,1,by=1/128
user  system elapsed
0.270.010.13
  all.equal(as.vector(r1), as.vector(r2))
 [1] TRUE
 
  Bill Dunlap
  Spotfire, TIBCO Software
  wdunlap tibco.com
 
 
  -Original Message-
  From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf
  Of lcn
  Sent: Friday, April 26, 2013 12:09 PM
  To: Mikhail Umorin
  Cc: r-help@r-project.org
  Subject: Re: [R] speed of a vector operation question
 
  I think the sum way is the best.
 
 
  On Fri, Apr 26, 2013 at 9:12 AM, Mikhail Umorin mike...@gmail.com wrote:
 
  Hello,
 
  I am dealing with numeric vectors 10^5 to 10^6 elements long. The values
  are
  sorted (with duplicates) in the vector (v). I am obtaining the length of
  vectors such as (v  c) or (v  c1  v  c2), where c, c1, c2 are some
  scalar
  variables. What is the most efficient way to do this?
 
  I am using sum(v  c) since TRUE's are 1's and FALSE's are 0's. This seems
  to
  me more efficient than length(which(v  c)), but, please, correct me if 
  I'm
  wrong. So, is there anything faster than what I already use?
 
  I'm running R 2.14.2 on Linux kernel 3.4.34.
 
  I appreciate your time,
 
  Mikhail
   [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read

[R] speed up merge

2012-03-02 Thread Ben quant
Hello,

I have a nasty loop that I have to do 11877 times. The only thing that
slows it down really is this merge:

xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)

Any ideas on how to speed it up? The output can't change materially (it
works), but I'd like it to go faster. I'm looking at getting around the
loop (not shown), but I'm trying to speed up the merge first. I'll post
regarding the loop if nothing comes of this post.

Here is some information on what type of stuff is going into the merge:

 class(ua_rd)
[1] matrix
 dim(ua_rd)
[1] 20  2
 head(ua_rd)
   AName  rt_date
2007-03-31 14066.580078125 2007-04-26
2007-06-30 14717   2007-07-19
2007-09-30 15528   2007-10-25
2007-12-31 17609   2008-01-24
2008-03-31 17168   2008-04-24
2008-06-30 17681   2008-07-17
 class(dt)
[1] character
 length(dt)
[1] 1799
 dt[1:10]
 [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04
2007-04-05 2007-04-06 2007-04-07
 [9] 2007-04-08 2007-04-09

thanks,

Ben

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up merge

2012-03-02 Thread kees duineveld
Hi Ben,

It seems you merge a matrix and a vector. As far as I understand the
first thing merge does is convert these to data.frame. Is it possible
to make the preceding steps give data frames?

Regards,
Kees

On Fri, Mar 2, 2012 at 11:24 AM, Ben quant ccqu...@gmail.com wrote:

 Hello,

 I have a nasty loop that I have to do 11877 times. The only thing that
 slows it down really is this merge:

 xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)

 Any ideas on how to speed it up? The output can't change materially (it
 works), but I'd like it to go faster. I'm looking at getting around the
 loop (not shown), but I'm trying to speed up the merge first. I'll post
 regarding the loop if nothing comes of this post.

 Here is some information on what type of stuff is going into the merge:

  class(ua_rd)
 [1] matrix
  dim(ua_rd)
 [1] 20  2
  head(ua_rd)
                   AName              rt_date
 2007-03-31 14066.580078125 2007-04-26
 2007-06-30 14717           2007-07-19
 2007-09-30 15528           2007-10-25
 2007-12-31 17609           2008-01-24
 2008-03-31 17168           2008-04-24
 2008-06-30 17681           2008-07-17
  class(dt)
 [1] character
  length(dt)
 [1] 1799
  dt[1:10]
  [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04
 2007-04-05 2007-04-06 2007-04-07
  [9] 2007-04-08 2007-04-09

 thanks,

 Ben

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up merge

2012-03-02 Thread Hans Ekbrand
On Fri, Mar 02, 2012 at 03:24:20AM -0700, Ben quant wrote:
 Hello,
 
 I have a nasty loop that I have to do 11877 times. 

Are you completely sure about that? I often find my self avoiding
loops-by-row by constructing vectors of which rows that fullfil a
condition, and then creating new vectors out of that vector. If you
elaborate on the problem, perhaps we could find a way to avoid the
loops altogether?

Mostly as a note to self, I wrote
http://code.cjb.net/vectors-instead-of-loop.html, it might be
understood by others too, but I'm not sure.

-- 
Hans Ekbrand (http://sociologi.cjb.net) h...@sociologi.cjb.net

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up merge

2012-03-02 Thread Ben quant
I'm not sure. I'm still looking into it. Its pretty involved, so I asked
the simplest answer first (the merge question).

I'll reply back with a mock-up/sample that is testable under a more
appropriate subject line. Probably this weekend.

Regards,

Ben


On Fri, Mar 2, 2012 at 4:37 AM, Hans Ekbrand h...@sociologi.cjb.net wrote:

 On Fri, Mar 02, 2012 at 03:24:20AM -0700, Ben quant wrote:
  Hello,
 
  I have a nasty loop that I have to do 11877 times.

 Are you completely sure about that? I often find my self avoiding
 loops-by-row by constructing vectors of which rows that fullfil a
 condition, and then creating new vectors out of that vector. If you
 elaborate on the problem, perhaps we could find a way to avoid the
 loops altogether?

 Mostly as a note to self, I wrote
 http://code.cjb.net/vectors-instead-of-loop.html, it might be
 understood by others too, but I'm not sure.

 --
 Hans Ekbrand (http://sociologi.cjb.net) h...@sociologi.cjb.net

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up merge

2012-03-02 Thread jim holtman
One way to speed up the merge is not to use merge.  You can use 'match' to
find matching indices and then manually.

Does this do what you want:

 ua - read.table(text = '  AName  rt_date
+ 2007-03-31 14066.580078125 2007-04-01
+ 2007-06-30 14717   2007-04-03
+ 2007-09-30 15528   2007-10-25
+ 2007-12-31 17609   2008-04-06
+ 2008-03-31 17168   2008-04-24
+ 2008-06-30 17681   2008-04-09', header = TRUE, as.is = TRUE)

 dt - c( 2007-03-31 ,2007-04-01 ,2007-04-02, 2007-04-03
,2007-04-04,
+ 2007-04-05 ,2007-04-06 ,2007-04-07,
+ 2007-04-08, 2007-04-09)

 # find matching values in ua
 indx - match(dt, ua$rt_date)

 # create new result matrix
 xx1 - cbind(dt, ua[indx,])
 rownames(xx1) - NULL  # delete funny names
 xx1
   dtANamert_date
1  2007-03-31   NA   NA
2  2007-04-01 14066.58 2007-04-01
3  2007-04-02   NA   NA
4  2007-04-03 14717.00 2007-04-03
5  2007-04-04   NA   NA
6  2007-04-05   NA   NA
7  2007-04-06   NA   NA
8  2007-04-07   NA   NA
9  2007-04-08   NA   NA
10 2007-04-09   NA   NA



On Fri, Mar 2, 2012 at 5:24 AM, Ben quant ccqu...@gmail.com wrote:

 Hello,

 I have a nasty loop that I have to do 11877 times. The only thing that
 slows it down really is this merge:

 xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)

 Any ideas on how to speed it up? The output can't change materially (it
 works), but I'd like it to go faster. I'm looking at getting around the
 loop (not shown), but I'm trying to speed up the merge first. I'll post
 regarding the loop if nothing comes of this post.

 Here is some information on what type of stuff is going into the merge:

  class(ua_rd)
 [1] matrix
  dim(ua_rd)
 [1] 20  2
  head(ua_rd)
   AName  rt_date
 2007-03-31 14066.580078125 2007-04-26
 2007-06-30 14717   2007-07-19
 2007-09-30 15528   2007-10-25
 2007-12-31 17609   2008-01-24
 2008-03-31 17168   2008-04-24
 2008-06-30 17681   2008-07-17
  class(dt)
 [1] character
  length(dt)
 [1] 1799
  dt[1:10]
  [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04
 2007-04-05 2007-04-06 2007-04-07
  [9] 2007-04-08 2007-04-09

 thanks,

 Ben

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up merge

2012-03-02 Thread Ben quant
I'll have to give this a try this weekend. Thank you!

ben

On Fri, Mar 2, 2012 at 12:07 PM, jim holtman jholt...@gmail.com wrote:

 One way to speed up the merge is not to use merge.  You can use 'match' to
 find matching indices and then manually.

 Does this do what you want:

  ua - read.table(text = '  AName  rt_date
 + 2007-03-31 14066.580078125 2007-04-01
 + 2007-06-30 14717   2007-04-03
 + 2007-09-30 15528   2007-10-25
 + 2007-12-31 17609   2008-04-06
 + 2008-03-31 17168   2008-04-24
 + 2008-06-30 17681   2008-04-09', header = TRUE, as.is = TRUE)
 
  dt - c( 2007-03-31 ,2007-04-01 ,2007-04-02, 2007-04-03
 ,2007-04-04,
 + 2007-04-05 ,2007-04-06 ,2007-04-07,
 + 2007-04-08, 2007-04-09)
 
  # find matching values in ua
  indx - match(dt, ua$rt_date)
 
  # create new result matrix
  xx1 - cbind(dt, ua[indx,])
  rownames(xx1) - NULL  # delete funny names
  xx1
dtANamert_date
 1  2007-03-31   NA   NA
 2  2007-04-01 14066.58 2007-04-01
 3  2007-04-02   NA   NA
 4  2007-04-03 14717.00 2007-04-03
 5  2007-04-04   NA   NA
 6  2007-04-05   NA   NA
 7  2007-04-06   NA   NA
 8  2007-04-07   NA   NA
 9  2007-04-08   NA   NA
 10 2007-04-09   NA   NA
 


 On Fri, Mar 2, 2012 at 5:24 AM, Ben quant ccqu...@gmail.com wrote:

 Hello,

 I have a nasty loop that I have to do 11877 times. The only thing that
 slows it down really is this merge:

 xx1 = merge(dt,ua_rd,by.x=1,by.y= 'rt_date',all.x=T)

 Any ideas on how to speed it up? The output can't change materially (it
 works), but I'd like it to go faster. I'm looking at getting around the
 loop (not shown), but I'm trying to speed up the merge first. I'll post
 regarding the loop if nothing comes of this post.

 Here is some information on what type of stuff is going into the merge:

  class(ua_rd)
 [1] matrix
  dim(ua_rd)
 [1] 20  2
  head(ua_rd)
   AName  rt_date
 2007-03-31 14066.580078125 2007-04-26
 2007-06-30 14717   2007-07-19
 2007-09-30 15528   2007-10-25
 2007-12-31 17609   2008-01-24
 2008-03-31 17168   2008-04-24
 2008-06-30 17681   2008-07-17
  class(dt)
 [1] character
  length(dt)
 [1] 1799
  dt[1:10]
  [1] 2007-03-31 2007-04-01 2007-04-02 2007-04-03 2007-04-04
 2007-04-05 2007-04-06 2007-04-07
  [9] 2007-04-08 2007-04-09

 thanks,

 Ben

[[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 Jim Holtman
 Data Munger Guru

 What is the problem that you are trying to solve?
 Tell me what you want to do, not how you want to do it.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up this algorithm (apply-fuction / 4D array)

2011-10-06 Thread Claudia Beleites

here's another one - which is easier to generalize:

x - array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91))
y - x [,,,1:90] # decide yourself what to do with slice 91, but
 # 91 is not divisible by 3
system.time ({
dim (y) - c (50, 50, 50, 3, 90 %/% 3)
y - aperm (y, c (4, 1:3, 5))
v2 - colMeans (y)
})
   User  System verstrichen
   0.320.080.40

(my computer is a bit slower than Bill's:)
 system.time (v1 - f1 (x))
   User  System verstrichen
  0.360   0.030   0.396

Claudia


Am 05.10.2011 20:24, schrieb William Dunlap:

I corrected your code a bit and put it into a function, f0, to
make testing easier.  I also made a small dataset to make
testing easier.  Then I made a new function f1 which does
what f0 does in a vectorized manner:

   x- array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91))
   xsmall- array(log(seq_len(2 * 2 * 2 * 91)), dim=c(2, 2, 2, 91))

   f0- function(x) {
   data_reduced- array(0, dim=c(dim(x)[1:3], trunc(dim(x)[4]/3)))
   reduce- seq(1, dim(x)[4]-1, by=3)
   for( i in 1:length(reduce) ) {
   data_reduced[ , , , i]- apply(x[ , , , reduce[i] : (reduce[i]+2) ], 
1:3, mean)
  }
  data_reduced
   }

   f1- function(x) {
  reduce- seq(1, dim(x)[4]-1, by=3)
  data_reduced- (x[, , , reduce] + x[, , , reduce+1] + x[, , , reduce+2]) 
/ 3
  data_reduced
   }

The results were:

 system.time(v1- f1(x))
  user  system elapsed
 0.280   0.040   0.323
 system.time(v0- f0(x))
  user  system elapsed
73.760   0.060  73.867
 all.equal(v0, v1)
   [1] TRUE


I thought apply would already vectorize, rather than loop over every 
coordinate.

No, you have that backwards.  Use *apply functions when you cannot figure
out how to vectorize.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Martin Batholdy
Sent: Wednesday, October 05, 2011 10:40 AM
To: R Help
Subject: [R] speed up this algorithm (apply-fuction / 4D array)

Hi,


I have this sample-code (see above) and I was wondering wether it is possible 
to speed things up.



What this code does is the following:

x is 4D array (you can imagine it as x, y, z-coordinates and a time-coordinate).

So x contains 50x50x50 data-arrays for 91 time-points.

Now I want to reduce the 91 time-points.
I want to merge three consecutive time points to one time-points by calculating 
the mean of this three
time-points for every x,y,z coordinate.

The reduce-sequence defines which time-points should get merged.
And the apply-function in the for-loop calculates the mean of the three 
3D-Arrays and puts them into a
new 4D array (data_reduced).



The problem is that even in this example it takes really long.
I thought apply would already vectorize, rather than loop over every coordinate.

But for my actual data-set it takes a really long time ... So I would be really 
grateful for any
suggestions how to speed this up.




x- array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91))



data_reduced- array(0, dim=c(50, 50, 50, 90/3))

reduce- seq(1,90, 3)



for( i in 1:length(reduce) ) {

data_reduced[ , , , i]-apply(x[ , , , reduce[i] : (reduce[i]+3) ], 
1:3, mean)
}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Claudia Beleites
Spectroscopy/Imaging
Institute of Photonic Technology
Albert-Einstein-Str. 9
07745 Jena
Germany

email: claudia.belei...@ipht-jena.de
phone: +49 3641 206-133
fax:   +49 2641 206-399

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed up this algorithm (apply-fuction / 4D array)

2011-10-05 Thread Martin Batholdy
Hi,


I have this sample-code (see above) and I was wondering wether it is possible 
to speed things up.



What this code does is the following:

x is 4D array (you can imagine it as x, y, z-coordinates and a time-coordinate).

So x contains 50x50x50 data-arrays for 91 time-points.

Now I want to reduce the 91 time-points.
I want to merge three consecutive time points to one time-points by calculating 
the mean of this three time-points for every x,y,z coordinate.

The reduce-sequence defines which time-points should get merged.
And the apply-function in the for-loop calculates the mean of the three 
3D-Arrays and puts them into a new 4D array (data_reduced).



The problem is that even in this example it takes really long.
I thought apply would already vectorize, rather than loop over every coordinate.

But for my actual data-set it takes a really long time … So I would be really 
grateful for any suggestions how to speed this up.




x - array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91))



data_reduced - array(0, dim=c(50, 50, 50, 90/3))

reduce - seq(1,90, 3)



for( i in 1:length(reduce) ) {

data_reduced[ , , , i]-apply(x[ , , , reduce[i] : 
(reduce[i]+3) ], 1:3, mean) 
}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up this algorithm (apply-fuction / 4D array)

2011-10-05 Thread William Dunlap
I corrected your code a bit and put it into a function, f0, to
make testing easier.  I also made a small dataset to make
testing easier.  Then I made a new function f1 which does
what f0 does in a vectorized manner:

  x - array(rnorm(50 * 50 * 50 * 91, 0, 2), dim=c(50, 50, 50, 91))
  xsmall - array(log(seq_len(2 * 2 * 2 * 91)), dim=c(2, 2, 2, 91))

  f0 - function(x) {
  data_reduced - array(0, dim=c(dim(x)[1:3], trunc(dim(x)[4]/3)))
  reduce - seq(1, dim(x)[4]-1, by=3)
  for( i in 1:length(reduce) ) {
  data_reduced[ , , , i] - apply(x[ , , , reduce[i] : (reduce[i]+2) ], 
1:3, mean)
 }
 data_reduced
  }

  f1 - function(x) {
 reduce - seq(1, dim(x)[4]-1, by=3)
 data_reduced - (x[, , , reduce] + x[, , , reduce+1] + x[, , , reduce+2]) 
/ 3
 data_reduced
  }

The results were:

   system.time(v1 - f1(x))
 user  system elapsed
0.280   0.040   0.323
   system.time(v0 - f0(x))
 user  system elapsed
   73.760   0.060  73.867
   all.equal(v0, v1)
  [1] TRUE

 I thought apply would already vectorize, rather than loop over every 
 coordinate.
No, you have that backwards.  Use *apply functions when you cannot figure
out how to vectorize.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf Of Martin Batholdy
 Sent: Wednesday, October 05, 2011 10:40 AM
 To: R Help
 Subject: [R] speed up this algorithm (apply-fuction / 4D array)
 
 Hi,
 
 
 I have this sample-code (see above) and I was wondering wether it is possible 
 to speed things up.
 
 
 
 What this code does is the following:
 
 x is 4D array (you can imagine it as x, y, z-coordinates and a 
 time-coordinate).
 
 So x contains 50x50x50 data-arrays for 91 time-points.
 
 Now I want to reduce the 91 time-points.
 I want to merge three consecutive time points to one time-points by 
 calculating the mean of this three
 time-points for every x,y,z coordinate.
 
 The reduce-sequence defines which time-points should get merged.
 And the apply-function in the for-loop calculates the mean of the three 
 3D-Arrays and puts them into a
 new 4D array (data_reduced).
 
 
 
 The problem is that even in this example it takes really long.
 I thought apply would already vectorize, rather than loop over every 
 coordinate.
 
 But for my actual data-set it takes a really long time ... So I would be 
 really grateful for any
 suggestions how to speed this up.
 
 
 
 
 x - array(rnorm(50 * 50 * 50 * 90, 0, 2), dim=c(50, 50, 50, 91))
 
 
 
 data_reduced - array(0, dim=c(50, 50, 50, 90/3))
 
 reduce - seq(1,90, 3)
 
 
 
 for( i in 1:length(reduce) ) {
 
   data_reduced[ , , , i]-apply(x[ , , , reduce[i] : 
 (reduce[i]+3) ], 1:3, mean)
 }
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed Advice for R --- avoid data frames

2011-07-06 Thread Frank Harrell
On occasion, as pointed out in an earlier posting, it is efficient to convert
to a matrix and when finished convert back to a data frame.  The Hmisc
package's asNumericMatrix and matrix2dataFrame functions assist by
converting character variables to factors if needed, and by holding on to
original attributes of variables in the data frame such as levels, then
restoring the attributes.

Frank


Uwe Ligges-3 wrote:
 
 On 02.07.2011 21:35, ivo welch wrote:
 hi uwe---thanks for the clarification.  of course, my example should
 always
 be done in vectorized form.  I only used it to show how iterative access
 compares in the simplest possible fashion.100 accesses per seconds is
 REALLY slow, though.

 I don't know R internals and the learning curve would be steep. 
 moreover,
 there is no guarantee that changes I would make would be accepted.  so, I
 cannot do this.

 however, for an R expert, this should not be too difficult. 
 conceptually,
 if data frame element access primitives are create/write/read/destroy in
 the
 code, then it's truly trivial.  just add a matrix (dim the same as the
 data
 frame) of byte pointers to point at the storage upon creation/change
 time.
   this would be quick-and-dirty.  for curiosity, do you know which source
 file has the data frame internals?  maybe I will get tempted anyway if it
 is
 simple enough.
 
 
 I think you should start to look at the mechanisms to construct 
 data.frames (such as data.frame) and learn that data.frames are special 
 lists. Then you may want to look at the differences between the 
 .Primitive([) and .Primitive([-) used for vectors (including 
 vectors with dim attributes such as matrixes) and the correspoding 
 methods for data.frames: [-.data.frame and [.data.frame.
 
 After that, I doubt you want to improve further on. Note also that 
 data.frames can be pretty large and you really do not want to store a 
 matrix of pointers as large as the data.frame. People working witrh 
 large data.frames won't be happy with such a suggestion.
 
 If you want to follow up, I'd suggest to move the thread to R-devel 
 where it seems to be more appropriate.
 
 Best,
 Uwe
 
 
 
 
 
 

 (a more efficient but more involved way to do this would be to store a
 data
 frame internally always as a matrix of data pointers, but this would
 probably require more surgery.)

 It is also not as important for me, as it is for others...to give a good
 impression to those that are not aware of the tradeoffs---which is most
 people considering to adopt R.

 /iaw


 
 Ivo Welch (ivo.we...@gmail.com)




 2011/7/2 Uwe Liggeslt;lig...@statistik.tu-dortmund.degt;

 Some comments:

 the comparison matrix rows vs. matrix columns is incorrect: Note that R
 has
 lazy evaluation, hence you construct your matrix in the timing for the
 rows
 and it is already constructed in the timing for the columns, hence you
 want
 to use:

   M- matrix( rnorm(C*R), nrow=R )
   D- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
   example(M)
   example(D)

 Further on, you are correct with you statement that data.frame indexing
 is
 much slower, but if you can store your data in matrix form, just go on
 as it
 is.

 I doubt anybody is really going to make the index operation you cited
 within a loop. Then, with a data.frame, I can live with many vectorized
 replacements again:

 system.time(D[,20]- sqrt(abs(D[,20])) + rnorm(1000))
user  system elapsed
0.010.000.01

 system.time(D[20,]- sqrt(abs(D[20,])) + rnorm(1000))
user  system elapsed
0.510.000.52

 OK, it would be nice to do that faster, but this is not easy. I think R
 Core is happy to see contributions to make it faster without breaking
 existing features.



 Best wishes,
 Uwe




 On 02.07.2011 20:35, ivo welch wrote:

 This email is intended for R users that are not that familiar with R
 internals and are searching google about how to speed up R.

 Despite common misperception, R is not slow when it comes to iterative
 access.  R is fast when it comes to matrices.  R is very slow when it
 comes to iterative access into data frames.  Such access occurs when a
 user uses data$varname[index], which is a very common operation.  To
 illustrate, run the following program:

 R- 1000; C- 1000

 example- function(m) {
cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]-
 sqrt(abs(m[r,20])) + rnorm(1) ), \n)
cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]-
 sqrt(abs(m[20,c])) + rnorm(1)), \n)
if (is.data.frame(m)) { cat(df: columns as names: );
 cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) +
 rnorm(1)), \n) }
 }

 cat(\n Now as matrix\n)
 example( matrix( rnorm(C*R), nrow=R ) )

 cat(\n Now as data frame\n)
 example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


 The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
 with ample RAM:

 matrix, columns: 0.01s
 matrix, rows: 0.175s
 data frame, columns: 53s
 data frame, rows: 56s
 data frame, names: 58s

 

Re: [R] Speed Advice for R --- avoid data frames

2011-07-03 Thread Uwe Ligges



On 02.07.2011 21:35, ivo welch wrote:

hi uwe---thanks for the clarification.  of course, my example should always
be done in vectorized form.  I only used it to show how iterative access
compares in the simplest possible fashion.100 accesses per seconds is
REALLY slow, though.

I don't know R internals and the learning curve would be steep.  moreover,
there is no guarantee that changes I would make would be accepted.  so, I
cannot do this.

however, for an R expert, this should not be too difficult.  conceptually,
if data frame element access primitives are create/write/read/destroy in the
code, then it's truly trivial.  just add a matrix (dim the same as the data
frame) of byte pointers to point at the storage upon creation/change time.
  this would be quick-and-dirty.  for curiosity, do you know which source
file has the data frame internals?  maybe I will get tempted anyway if it is
simple enough.



I think you should start to look at the mechanisms to construct 
data.frames (such as data.frame) and learn that data.frames are special 
lists. Then you may want to look at the differences between the 
.Primitive([) and .Primitive([-) used for vectors (including 
vectors with dim attributes such as matrixes) and the correspoding 
methods for data.frames: [-.data.frame and [.data.frame.


After that, I doubt you want to improve further on. Note also that 
data.frames can be pretty large and you really do not want to store a 
matrix of pointers as large as the data.frame. People working witrh 
large data.frames won't be happy with such a suggestion.


If you want to follow up, I'd suggest to move the thread to R-devel 
where it seems to be more appropriate.


Best,
Uwe








(a more efficient but more involved way to do this would be to store a data
frame internally always as a matrix of data pointers, but this would
probably require more surgery.)

It is also not as important for me, as it is for others...to give a good
impression to those that are not aware of the tradeoffs---which is most
people considering to adopt R.

/iaw



Ivo Welch (ivo.we...@gmail.com)




2011/7/2 Uwe Liggeslig...@statistik.tu-dortmund.de


Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that R has
lazy evaluation, hence you construct your matrix in the timing for the rows
and it is already constructed in the timing for the columns, hence you want
to use:

  M- matrix( rnorm(C*R), nrow=R )
  D- as.data.frame(matrix( rnorm(C*R), nrow=R ) )
  example(M)
  example(D)

Further on, you are correct with you statement that data.frame indexing is
much slower, but if you can store your data in matrix form, just go on as it
is.

I doubt anybody is really going to make the index operation you cited
within a loop. Then, with a data.frame, I can live with many vectorized
replacements again:


system.time(D[,20]- sqrt(abs(D[,20])) + rnorm(1000))

   user  system elapsed
   0.010.000.01


system.time(D[20,]- sqrt(abs(D[20,])) + rnorm(1000))

   user  system elapsed
   0.510.000.52

OK, it would be nice to do that faster, but this is not easy. I think R
Core is happy to see contributions to make it faster without breaking
existing features.



Best wishes,
Uwe




On 02.07.2011 20:35, ivo welch wrote:


This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses data$varname[index], which is a very common operation.  To
illustrate, run the following program:

R- 1000; C- 1000

example- function(m) {
   cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]-
sqrt(abs(m[r,20])) + rnorm(1) ), \n)
   cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]-
sqrt(abs(m[20,c])) + rnorm(1)), \n)
   if (is.data.frame(m)) { cat(df: columns as names: );
cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) +
rnorm(1)), \n) }
}

cat(\n Now as matrix\n)
example( matrix( rnorm(C*R), nrow=R ) )

cat(\n Now as data frame\n)
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.

[R] Speed Advice for R --- avoid data frames

2011-07-02 Thread ivo welch
This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses data$varname[index], which is a very common operation.  To
illustrate, run the following program:

R - 1000; C - 1000

example - function(m) {
  cat(rows: ); cat(system.time( for (r in 1:R) m[r,20] -
sqrt(abs(m[r,20])) + rnorm(1) ), \n)
  cat(columns: ); cat(system.time(for (c in 1:C) m[20,c] -
sqrt(abs(m[20,c])) + rnorm(1)), \n)
  if (is.data.frame(m)) { cat(df: columns as names: );
cat(system.time(for (c in 1:C) m[[c]][20] - sqrt(abs(m[[c]][20])) +
rnorm(1)), \n) }
}

cat(\n Now as matrix\n)
example( matrix( rnorm(C*R), nrow=R ) )

cat(\n Now as data frame\n)
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.


I hope this message post helps others.

/iaw


Ivo Welch (ivo.we...@gmail.com)
http://www.ivo-welch.info/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed Advice for R --- avoid data frames

2011-07-02 Thread Uwe Ligges

Some comments:

the comparison matrix rows vs. matrix columns is incorrect: Note that R 
has lazy evaluation, hence you construct your matrix in the timing for 
the rows and it is already constructed in the timing for the columns, 
hence you want to use:


 M - matrix( rnorm(C*R), nrow=R )
 D - as.data.frame(matrix( rnorm(C*R), nrow=R ) )
 example(M)
 example(D)

Further on, you are correct with you statement that data.frame indexing 
is much slower, but if you can store your data in matrix form, just go 
on as it is.


I doubt anybody is really going to make the index operation you cited 
within a loop. Then, with a data.frame, I can live with many vectorized 
replacements again:


 system.time(D[,20] - sqrt(abs(D[,20])) + rnorm(1000))
   user  system elapsed
   0.010.000.01

 system.time(D[20,] - sqrt(abs(D[20,])) + rnorm(1000))
   user  system elapsed
   0.510.000.52

OK, it would be nice to do that faster, but this is not easy. I think R 
Core is happy to see contributions to make it faster without breaking 
existing features.




Best wishes,
Uwe



On 02.07.2011 20:35, ivo welch wrote:

This email is intended for R users that are not that familiar with R
internals and are searching google about how to speed up R.

Despite common misperception, R is not slow when it comes to iterative
access.  R is fast when it comes to matrices.  R is very slow when it
comes to iterative access into data frames.  Such access occurs when a
user uses data$varname[index], which is a very common operation.  To
illustrate, run the following program:

R- 1000; C- 1000

example- function(m) {
   cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]-
sqrt(abs(m[r,20])) + rnorm(1) ), \n)
   cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]-
sqrt(abs(m[20,c])) + rnorm(1)), \n)
   if (is.data.frame(m)) { cat(df: columns as names: );
cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) +
rnorm(1)), \n) }
}

cat(\n Now as matrix\n)
example( matrix( rnorm(C*R), nrow=R ) )

cat(\n Now as data frame\n)
example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
with ample RAM:

matrix, columns: 0.01s
matrix, rows: 0.175s
data frame, columns: 53s
data frame, rows: 56s
data frame, names: 58s

Data frame access is about 5,000 times slower than matrix column
access, and 300 times slower than matrix row access.  R's data frame
operational speed is an amazing 40 data accesses per seconds.  I have
not seen access numbers this low for decades.


How to avoid it?  Not easy.  One way is to create multiple matrices,
and group them as an object.  of course, this loses a lot of features
of R.  Another way is to copy all data used in calculations out of the
data frame into a matrix, do the operations, and then copy them back.
not ideal, either.

In my opinion, this is an R design flow.  Data frames are the
fundamental unit of much statistical analysis, and should be fast.  I
think R lacks any indexing into data frames.  Turning on indexing of
data frames should at least be an optional feature.


I hope this message post helps others.

/iaw


Ivo Welch (ivo.we...@gmail.com)
http://www.ivo-welch.info/

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed Advice for R --- avoid data frames

2011-07-02 Thread ivo welch
hi uwe---thanks for the clarification.  of course, my example should always
be done in vectorized form.  I only used it to show how iterative access
compares in the simplest possible fashion.  100 accesses per seconds is
REALLY slow, though.

I don't know R internals and the learning curve would be steep.  moreover,
there is no guarantee that changes I would make would be accepted.  so, I
cannot do this.

however, for an R expert, this should not be too difficult.  conceptually,
if data frame element access primitives are create/write/read/destroy in the
code, then it's truly trivial.  just add a matrix (dim the same as the data
frame) of byte pointers to point at the storage upon creation/change time.
 this would be quick-and-dirty.  for curiosity, do you know which source
file has the data frame internals?  maybe I will get tempted anyway if it is
simple enough.

(a more efficient but more involved way to do this would be to store a data
frame internally always as a matrix of data pointers, but this would
probably require more surgery.)

It is also not as important for me, as it is for others...to give a good
impression to those that are not aware of the tradeoffs---which is most
people considering to adopt R.

/iaw



Ivo Welch (ivo.we...@gmail.com)




2011/7/2 Uwe Ligges lig...@statistik.tu-dortmund.de

 Some comments:

 the comparison matrix rows vs. matrix columns is incorrect: Note that R has
 lazy evaluation, hence you construct your matrix in the timing for the rows
 and it is already constructed in the timing for the columns, hence you want
 to use:

  M - matrix( rnorm(C*R), nrow=R )
  D - as.data.frame(matrix( rnorm(C*R), nrow=R ) )
  example(M)
  example(D)

 Further on, you are correct with you statement that data.frame indexing is
 much slower, but if you can store your data in matrix form, just go on as it
 is.

 I doubt anybody is really going to make the index operation you cited
 within a loop. Then, with a data.frame, I can live with many vectorized
 replacements again:

  system.time(D[,20] - sqrt(abs(D[,20])) + rnorm(1000))
   user  system elapsed
   0.010.000.01

  system.time(D[20,] - sqrt(abs(D[20,])) + rnorm(1000))
   user  system elapsed
   0.510.000.52

 OK, it would be nice to do that faster, but this is not easy. I think R
 Core is happy to see contributions to make it faster without breaking
 existing features.



 Best wishes,
 Uwe




 On 02.07.2011 20:35, ivo welch wrote:

 This email is intended for R users that are not that familiar with R
 internals and are searching google about how to speed up R.

 Despite common misperception, R is not slow when it comes to iterative
 access.  R is fast when it comes to matrices.  R is very slow when it
 comes to iterative access into data frames.  Such access occurs when a
 user uses data$varname[index], which is a very common operation.  To
 illustrate, run the following program:

 R- 1000; C- 1000

 example- function(m) {
   cat(rows: ); cat(system.time( for (r in 1:R) m[r,20]-
 sqrt(abs(m[r,20])) + rnorm(1) ), \n)
   cat(columns: ); cat(system.time(for (c in 1:C) m[20,c]-
 sqrt(abs(m[20,c])) + rnorm(1)), \n)
   if (is.data.frame(m)) { cat(df: columns as names: );
 cat(system.time(for (c in 1:C) m[[c]][20]- sqrt(abs(m[[c]][20])) +
 rnorm(1)), \n) }
 }

 cat(\n Now as matrix\n)
 example( matrix( rnorm(C*R), nrow=R ) )

 cat(\n Now as data frame\n)
 example( as.data.frame( matrix( rnorm(C*R), nrow=R ) ) )


 The following are the reported timing under R 2.12.0 on a Mac Pro 3,1
 with ample RAM:

 matrix, columns: 0.01s
 matrix, rows: 0.175s
 data frame, columns: 53s
 data frame, rows: 56s
 data frame, names: 58s

 Data frame access is about 5,000 times slower than matrix column
 access, and 300 times slower than matrix row access.  R's data frame
 operational speed is an amazing 40 data accesses per seconds.  I have
 not seen access numbers this low for decades.


 How to avoid it?  Not easy.  One way is to create multiple matrices,
 and group them as an object.  of course, this loses a lot of features
 of R.  Another way is to copy all data used in calculations out of the
 data frame into a matrix, do the operations, and then copy them back.
 not ideal, either.

 In my opinion, this is an R design flow.  Data frames are the
 fundamental unit of much statistical analysis, and should be fast.  I
 think R lacks any indexing into data frames.  Turning on indexing of
 data frames should at least be an optional feature.


 I hope this message post helps others.

 /iaw

 
 Ivo Welch (ivo.we...@gmail.com)
 http://www.ivo-welch.info/

 __**
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/**
 posting-guide.html http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



   

[R] Speed up an R code

2011-05-27 Thread Debs Majumdar
Hello,

  Are there some basic things one can do to speed up a R code? I am new to R 
and currently going through the following situation.

  I have run a R code on two different machines. I have R 2.12 installed on 
both.

  Desktop 1 is slightly older and has a dual core processor with 4gigs of RAM. 
Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of RAM. Both 
run on Windows 7.

  I don't really see any significant speed up in the newer computer (Desktop 
2). In the older one the program took around 5hrs 15 mins and in the newer one 
it took almost 4hrs 30mins.

 In the newer dekstop, R gives me the following:

 memory.limit()
[1] 1024
 memory.size()
[1] 20.03

 Is something hampering me here? Do I need to increase the limit and size? Can 
this change be made permanent? Or am I looking at the wrong place? 

 I have never seen my R programs using much CPU or RAM when it runs? If this is 
not something inherent to R, then I guess I need to write more effiecient codes.

 Suggestions/solutions are welcome.

  Thanks,

  -Debs


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up an R code

2011-05-27 Thread Jonathan Daily
This is a very open ended question that depends very heavily on what
you are trying to do and how you are doing it. Often times, the
bottleneck operations that limit speed the most are not necessarily
sped up by adding RAM. They also often require special setup to run
multiple operations/iterations in parallel. Try some of the options at
the High Performance Computing task view for specifics.

http://cran.cnr.berkeley.edu/web/views/

HTH,
Jon

On Fri, May 27, 2011 at 2:00 PM, Debs Majumdar debs_st...@yahoo.com wrote:
 Hello,

   Are there some basic things one can do to speed up a R code? I am new to R 
 and currently going through the following situation.

   I have run a R code on two different machines. I have R 2.12 installed on 
 both.

   Desktop 1 is slightly older and has a dual core processor with 4gigs of 
 RAM. Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of 
 RAM. Both run on Windows 7.

   I don't really see any significant speed up in the newer computer (Desktop 
 2). In the older one the program took around 5hrs 15 mins and in the newer 
 one it took almost 4hrs 30mins.

  In the newer dekstop, R gives me the following:

 memory.limit()
 [1] 1024
 memory.size()
 [1] 20.03

  Is something hampering me here? Do I need to increase the limit and size? 
 Can this change be made permanent? Or am I looking at the wrong place?

  I have never seen my R programs using much CPU or RAM when it runs? If this 
 is not something inherent to R, then I guess I need to write more effiecient 
 codes.

  Suggestions/solutions are welcome.

   Thanks,

   -Debs


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
===
Jon Daily
Technician
===
#!/usr/bin/env outside
# It's great, trust me.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up an R code

2011-05-27 Thread jim holtman
Take a small subset of your program that would run through the
critical sections and use ?Rprof to see where some of the hot spot
are.  How do you know it is not using the CPU?  Are you using perfmon
to look what is being used?  Are you paging?  If you are not paging,
and not doing a lot of I/O, then you should tie up one CPU 100% if you
are CPU bound.

You probably need to put some output in your program to mark its
progress.  At a minimum, do the following:

cat('I am here', proc.time(), '\n')

By hcaning the initial string, you can see where you are and this is
also reporting the user CPU, system CPU and elapsed time.  This should
be a good indication of where time is being spent.  So there are a
number of things you can  do to instrument your code.  If I had a
program that was running for hours, I would definitely have something
that tell me where I am at and how much time is being taken.  If you
have some large loop, then you could put out this information every
'n'th time through.  The tag on the message would indicate this.
There is also the progress bar that I use a lot to see if I amd
making progress.

After you have instrumented your code and have use Rprof, you might
have some data that people would help you with.

If you are using dataframe a lot, remember that indexing into them can
be costly.  Converting them to matrices, if appropriate, can give a
big speed. Rprof will show you this.

On Fri, May 27, 2011 at 2:00 PM, Debs Majumdar debs_st...@yahoo.com wrote:
 Hello,

   Are there some basic things one can do to speed up a R code? I am new to R 
 and currently going through the following situation.

   I have run a R code on two different machines. I have R 2.12 installed on 
 both.

   Desktop 1 is slightly older and has a dual core processor with 4gigs of 
 RAM. Desktop 2 is newer one and has a xeon processor W3505 with 12gigs of 
 RAM. Both run on Windows 7.

   I don't really see any significant speed up in the newer computer (Desktop 
 2). In the older one the program took around 5hrs 15 mins and in the newer 
 one it took almost 4hrs 30mins.

  In the newer dekstop, R gives me the following:

 memory.limit()
 [1] 1024
 memory.size()
 [1] 20.03

  Is something hampering me here? Do I need to increase the limit and size? 
 Can this change be made permanent? Or am I looking at the wrong place?

  I have never seen my R programs using much CPU or RAM when it runs? If this 
 is not something inherent to R, then I guess I need to write more effiecient 
 codes.

  Suggestions/solutions are welcome.

   Thanks,

   -Debs


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up code with for() loop

2011-05-01 Thread Uwe Ligges



On 29.04.2011 22:20, hck wrote:

Barth sent me a very good code and I modified it a bit. Have a look:

Error-rnorm(1000, mean=0, sd=0.05)
estimate-(log(1+0.10)+Error)

DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2
))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1))
DCF_verzerrt-(1/(exp(estimate)-1))

S- 1000   # total sample size
D- 1  # number of subsamples
Subset- 1  # number in each subsample
Select- matrix(sample(S,D*Subset,replace=TRUE),nrow=Subset,ncol=D)

DCF_korrigiert_select- matrix(DCF_korrigiert[Select],nrow=Subset,ncol=D)
Delta_ln-(log(colMeans(DCF_korrigiert_select, na.rm=T)/(1/0.10)))



The only problem I discovered is that R cannot handle more than
2.147.483.647 integers, thus the cells in the matrix are bounded by this
condition. (R shows the max by typing: .Machine$integer.max). And if you
want to safe the workspace, the file with 10.000 times 10.000 becomes round
2 GB. Compared to the original of just 300 MB.

So I cannot perform my previous bootstrap with 1.000.000 times 100.000. But
nevertheless 10.000 times 10.000 seems to be sufficiently; I have to say its
amazing, how fast the idea works.

Has anybody a suggestion how to make it work for the 1.000.000 times 100.000
bootstrap???


Run it in several blocks of matrices with appropriate dimensions? This 
allows easy parallelization as well.


Uwe Ligges








--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3484548.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-29 Thread jim holtman
If you are plotting that many data points, you  might want to look at
'hexbin' as a way of aggregating the values to a different
presentation.  It is especially nice if you are doing a scatter plot
with a lot of data points and trying to make sense out of it.

On Wed, Apr 27, 2011 at 5:16 AM, Jonathan Gabris jonat...@k-m-p.nl wrote:
 Hello,

 I am working on a project analysing the performance of motor-vehicles
 through messages logged over a CAN bus.

 I am using R 2.12 on Windows XP and 7

 I am currently plotting the data in R, overlaying 5 or more plots of data,
 logged at 1kHz, (using plot.ts() and par(new = TRUE)).
 The aim is to be able to pan, zoom in and out and get values from the
 plotted graph using a custom Qt interface that is used as a front end to
 R.exe (all this works).
 The plot is drawn by R directly to the windows graphic device.

 The data is imported from a .csv file (typically around 100MB) to a matrix.
 (timestamp, message ID, byte0, byte1, ..., byte7)
 I then separate this matrix into several by message ID (dimensions are in
 the order of 8cols, 10^6 rows)

 The panning is done by redrawing the plots, shifted by a small amount. So as
 to view a window of data from a second to a minute long that can travel the
 length of the logged data.

 My problem is that, the redrawing of the plots whilst panning is too slow
 when dealing with this much data.
 i.e.: I can see the last graphs being drawn to the screen in the half-second
 following the view change.
 I need a fluid change from one view to the next.

 My question is this:
 Are there ways to speed up the plotting on the MSWindows display?
 By reducing plotted point densities to *sensible* values?
 Using something other than plot.ts() - is the lattice package faster?
 I don't need publication quality plots, they can be rougher...

 I have tried:
 -Using matrices instead of dataframes - (works for calculations but not
 enough for plots)
 -increasing the max usable memory (max-mem-size) - (no change)
 -increasing the size of the pointer protection stack (max-ppsize) - (no
 change)
 -deleting the unnecessary leftover matrices - (no change)
 -I can't use lines() instead of plot() because of the very  different scales
 (rpm-1, flags -1to3)

 I am going to do some resampling of the logged data to reduce the vector
 sizes.
 (removal of *less* important data and use of window.ts())

 But I am currently running out of ideas...
 So if sombody could point out something, I would be greatfull.

 Thanks,

 Jonathan Gabris

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up code with for() loop

2011-04-29 Thread hck
Barth sent me a very good code and I modified it a bit. Have a look:

Error-rnorm(1000, mean=0, sd=0.05)
estimate-(log(1+0.10)+Error)

DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2
))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1))
DCF_verzerrt-(1/(exp(estimate)-1))

S - 1000   # total sample size
D - 1  # number of subsamples
Subset - 1  # number in each subsample
Select - matrix(sample(S,D*Subset,replace=TRUE),nrow=Subset,ncol=D)

DCF_korrigiert_select - matrix(DCF_korrigiert[Select],nrow=Subset,ncol=D)
Delta_ln -(log(colMeans(DCF_korrigiert_select, na.rm=T)/(1/0.10)))



The only problem I discovered is that R cannot handle more than
2.147.483.647 integers, thus the cells in the matrix are bounded by this
condition. (R shows the max by typing: .Machine$integer.max). And if you
want to safe the workspace, the file with 10.000 times 10.000 becomes round
2 GB. Compared to the original of just 300 MB. 

So I cannot perform my previous bootstrap with 1.000.000 times 100.000. But
nevertheless 10.000 times 10.000 seems to be sufficiently; I have to say its
amazing, how fast the idea works.

Has anybody a suggestion how to make it work for the 1.000.000 times 100.000
bootstrap???


--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3484548.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed up code with for() loop

2011-04-28 Thread hck
Hallo everybody,

I'm wondering whether it might be possible to speed up the following code:

Error-rnorm(1000, mean=0, sd=0.05)

estimate-(log(1.1)-Error)

DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1))
D-10
Delta_ln-rep(0,D)
for(i in 1:D)
Delta_ln[i]-(log(mean(sample(DCF_korrigiert,100,replace=TRUE))/(1/0.10)))

The calculation of the for-loop takes several hours even on a very quick
machine (4GHz, 8 GB RAM Windows 2008 Server 64bit). Has anybody an idea, how
to improve the for-line?

Thanks for helping me.
Hans

--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-code-with-for-loop-tp3481680p3481680.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up code with for() loop

2011-04-28 Thread Jeremy Hetzel
Hans,

You could parallelize it with the multicore package.  The only other thing I 
can think of is to use calls to .Internal().  But be vigilant, as this might 
not be good advice.  ?.Internal warns that only true R wizards should even 
consider using the function.  First, an example with .Internal() calls, 
later mutlicore.  For me, the following reduces elapsed time by about 9% on 
Windows 7 and by about 20% on today's new Ubuntu Natty.

## Set number of replicates
n - 1

## Your example
set.seed(1)
time.one - Sys.time()
Error-rnorm(n, mean=0, sd=0.05)
estimate-(log(1.1)-Error)
DCF_korrigiert-(1/(exp(1/(exp(0.5*(-estimate)^2/(0.05^2))*sqrt(2*pi/(0.05^2))*(1-pnorm(0,((-estimate)/(0.05^2)),sqrt(1/(0.05^2))-1))
D-n
Delta_ln-rep(0,D)
for(i in 1:D)
Delta_ln[i]-(log(mean(sample(DCF_korrigiert,D,replace=TRUE))/(1/0.10)))
time.one - Sys.time() - time.one

## A few modifications with .Internal()
set.seed(1)
time.two - Sys.time()
Error - rnorm(n, mean = 0, sd = 0.05)
estimate - (log(1.1) - Error)
DCF_korrigiert - (1 / (exp(1 / (exp(0.5 * (-estimate)^2 / (0.05^2)) * sqrt(
2* pi / (0.05^2)) * (1 - pnorm(0,((-estimate) / (0.05^2)), sqrt(1 / 
(0.05^2))-1))
D - n
Delta_ln2 - numeric(length = D)
Delta_ln2 - vapply(Delta_ln2, function(x)
{
log(.Internal(mean(DCF_korrigiert[.Internal(
sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10))
}, FUN.VALUE = 1)
time.two - Sys.time() - time.two


## Compare
all.equal(Delta_ln, Delta_ln2)
time.one
time.two
as.numeric(time.two) / as.numeric(time.one)






Then you could parallelize it with multicore's parallel() function:

## Try multicore
require(multicore)
set.seed(1)
time.three - Sys.time()
Error - rnorm(n, mean = 0, sd = 0.05)
estimate - (log(1.1) - Error)
DCF_korrigiert - (1 / (exp(1 / (exp(0.5 * (-estimate)^2 / (0.05^2)) * sqrt(
2* pi / (0.05^2)) * (1 - pnorm(0,((-estimate) / (0.05^2)), sqrt(1 / 
(0.05^2))-1))
D - n/2
Delta_ln3 - numeric(length = D)
Delta_ln3.1 - parallel(vapply(Delta_ln3, function(x)
{
log(.Internal(mean(DCF_korrigiert[.Internal(
sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10))
}, FUN.VALUE = 1), mc.set.seed = T)
Delta_ln3.2 - parallel(vapply(Delta_ln3, function(x)
{
log(.Internal(mean(DCF_korrigiert[.Internal(
sample(D, D, replace = T, prob = NULL))])) / (1 / 0.10))
}, FUN.VALUE = 1), mc.set.seed = T)
results - collect(list(Delta_ln3.1, Delta_ln3.2))
names(results) - NULL
Delta_ln3 - do.call(append, results)
time.three - Sys.time() - time.three


## Compare
# Results won't be equal due to the different way 
# parallel() handles set.seed() randomization
all.equal(Delta_ln, Delta_ln3)
time.one
time.two
time.three
as.numeric(time.three) / as.numeric(time.one)



Combining parallel() with the .Internal calls reduces the elapsed time by 
about 70% on Ubuntu Natty.  Multicore is not available for Windows, or at 
least not easily available for Windows.  

But maybe the true R wizards have better ideas.


Jeremy


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Jonathan Gabris

Hello,

I am working on a project analysing the performance of motor-vehicles 
through messages logged over a CAN bus.


I am using R 2.12 on Windows XP and 7

I am currently plotting the data in R, overlaying 5 or more plots of 
data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
The aim is to be able to pan, zoom in and out and get values from the 
plotted graph using a custom Qt interface that is used as a front end to 
R.exe (all this works).

The plot is drawn by R directly to the windows graphic device.

The data is imported from a .csv file (typically around 100MB) to a matrix.
(timestamp, message ID, byte0, byte1, ..., byte7)
I then separate this matrix into several by message ID (dimensions are 
in the order of 8cols, 10^6 rows)


The panning is done by redrawing the plots, shifted by a small amount. 
So as to view a window of data from a second to a minute long that can 
travel the length of the logged data.


My problem is that, the redrawing of the plots whilst panning is too 
slow when dealing with this much data.
i.e.: I can see the last graphs being drawn to the screen in the 
half-second following the view change.

I need a fluid change from one view to the next.

My question is this:
Are there ways to speed up the plotting on the MSWindows display?
By reducing plotted point densities to *sensible* values?
Using something other than plot.ts() - is the lattice package faster?
I don't need publication quality plots, they can be rougher...

I have tried:
-Using matrices instead of dataframes - (works for calculations but not 
enough for plots)

-increasing the max usable memory (max-mem-size) - (no change)
-increasing the size of the pointer protection stack (max-ppsize) - (no 
change)

-deleting the unnecessary leftover matrices - (no change)
-I can't use lines() instead of plot() because of the very  different 
scales (rpm-1, flags -1to3)


I am going to do some resampling of the logged data to reduce the vector 
sizes.

(removal of *less* important data and use of window.ts())

But I am currently running out of ideas...
So if sombody could point out something, I would be greatfull.

Thanks,

Jonathan Gabris

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Duncan Murdoch

Jonathan Gabris wrote:

Hello,

I am working on a project analysing the performance of motor-vehicles 
through messages logged over a CAN bus.


I am using R 2.12 on Windows XP and 7

I am currently plotting the data in R, overlaying 5 or more plots of 
data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
The aim is to be able to pan, zoom in and out and get values from the 
plotted graph using a custom Qt interface that is used as a front end to 
R.exe (all this works).

The plot is drawn by R directly to the windows graphic device.

The data is imported from a .csv file (typically around 100MB) to a matrix.
(timestamp, message ID, byte0, byte1, ..., byte7)
I then separate this matrix into several by message ID (dimensions are 
in the order of 8cols, 10^6 rows)


The panning is done by redrawing the plots, shifted by a small amount. 
So as to view a window of data from a second to a minute long that can 
travel the length of the logged data.


My problem is that, the redrawing of the plots whilst panning is too 
slow when dealing with this much data.
i.e.: I can see the last graphs being drawn to the screen in the 
half-second following the view change.

I need a fluid change from one view to the next.

My question is this:
Are there ways to speed up the plotting on the MSWindows display?
By reducing plotted point densities to *sensible* values?
Using something other than plot.ts() - is the lattice package faster?
I don't need publication quality plots, they can be rougher...


I don't think there are any ways to plot in the standard device that are 
significantly faster than what you are doing if you want to see the 
updates.  (I think it would be substantially faster if you hid the 
graphics window during the updates, but that won't suit you.)


I'd suggest plotting a subset of the data during the updates, then plot 
the full dataset when it stops moving.  For example, only plot a few 
hundred points, even spaced through the time series.


Duncan Murdoch



I have tried:
-Using matrices instead of dataframes - (works for calculations but not 
enough for plots)

-increasing the max usable memory (max-mem-size) - (no change)
-increasing the size of the pointer protection stack (max-ppsize) - (no 
change)

-deleting the unnecessary leftover matrices - (no change)
-I can't use lines() instead of plot() because of the very  different 
scales (rpm-1, flags -1to3)


I am going to do some resampling of the logged data to reduce the vector 
sizes.

(removal of *less* important data and use of window.ts())

But I am currently running out of ideas...
So if sombody could point out something, I would be greatfull.

Thanks,

Jonathan Gabris

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Mike Marchywka



 Date: Wed, 27 Apr 2011 11:16:26 +0200
 From: jonat...@k-m-p.nl
 To: r-help@r-project.org
 Subject: [R] Speed up plotting to MSWindows graphics window

 Hello,

 I am working on a project analysing the performance of motor-vehicles
 through messages logged over a CAN bus.

 I am using R 2.12 on Windows XP and 7

 I am currently plotting the data in R, overlaying 5 or more plots of
 data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
 The aim is to be able to pan, zoom in and out and get values from the
 plotted graph using a custom Qt interface that is used as a front end to
 R.exe (all this works).
 The plot is drawn by R directly to the windows graphic device.

 The data is imported from a .csv file (typically around 100MB) to a matrix.
 (timestamp, message ID, byte0, byte1, ..., byte7)
 I then separate this matrix into several by message ID (dimensions are
 in the order of 8cols, 10^6 rows)

 The panning is done by redrawing the plots, shifted by a small amount.
 So as to view a window of data from a second to a minute long that can
 travel the length of the logged data.

 My problem is that, the redrawing of the plots whilst panning is too
 slow when dealing with this much data.
 i.e.: I can see the last graphs being drawn to the screen in the
 half-second following the view change.
 I need a fluid change from one view to the next.

 My question is this:
 Are there ways to speed up the plotting on the MSWindows display?
 By reducing plotted point densities to *sensible* values?

Well, hard to know but it would help to know where all the time is going.
Usually people start complaining when VM thrashing is common but if you are
CPU limited you could try restricting the range of data you want to plot
rather than relying on the plot to just clip the largely irrelevant points
when you are zoomed in. It should not be too expensive to find the
limits either incrementally or with binary search on ordered time series. 
Presumably subsetting is fast using  foo[a:b,] 

One thing you may want to try for change of scale is wavelet  or
multi-resolution analysis. You can make a tree ( increasing memory usage
but even VM here may not be a big penalty if coherence is high ) and
display the resolution appropriate for the current scale. 




 Using something other than plot.ts() - is the lattice package faster?
 I don't need publication quality plots, they can be rougher...

 I have tried:
 -Using matrices instead of dataframes - (works for calculations but not
 enough for plots)
 -increasing the max usable memory (max-mem-size) - (no change)
 -increasing the size of the pointer protection stack (max-ppsize) - (no
 change)
 -deleting the unnecessary leftover matrices - (no change)
 -I can't use lines() instead of plot() because of the very different
 scales (rpm-1, flags -1to3)

 I am going to do some resampling of the logged data to reduce the vector
 sizes.
 (removal of *less* important data and use of window.ts())

 But I am currently running out of ideas...
 So if sombody could point out something, I would be greatfull.

 Thanks,

 Jonathan Gabris

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Uwe Ligges



On 27.04.2011 12:56, Duncan Murdoch wrote:

Jonathan Gabris wrote:

Hello,

I am working on a project analysing the performance of motor-vehicles
through messages logged over a CAN bus.

I am using R 2.12 on Windows XP and 7

I am currently plotting the data in R, overlaying 5 or more plots of
data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
The aim is to be able to pan, zoom in and out and get values from the
plotted graph using a custom Qt interface that is used as a front end
to R.exe (all this works).
The plot is drawn by R directly to the windows graphic device.

The data is imported from a .csv file (typically around 100MB) to a
matrix.
(timestamp, message ID, byte0, byte1, ..., byte7)
I then separate this matrix into several by message ID (dimensions are
in the order of 8cols, 10^6 rows)

The panning is done by redrawing the plots, shifted by a small amount.
So as to view a window of data from a second to a minute long that can
travel the length of the logged data.

My problem is that, the redrawing of the plots whilst panning is too
slow when dealing with this much data.
i.e.: I can see the last graphs being drawn to the screen in the
half-second following the view change.
I need a fluid change from one view to the next.

My question is this:
Are there ways to speed up the plotting on the MSWindows display?
By reducing plotted point densities to *sensible* values?
Using something other than plot.ts() - is the lattice package faster?
I don't need publication quality plots, they can be rougher...


I don't think there are any ways to plot in the standard device that are
significantly faster than what you are doing if you want to see the
updates. (I think it would be substantially faster if you hid the
graphics window during the updates, but that won't suit you.)

I'd suggest plotting a subset of the data during the updates, then plot
the full dataset when it stops moving. For example, only plot a few
hundred points, even spaced through the time series.



... and it highly depends on the data what can be improved. Example: For 
signals essential consisting of sine functions (i.e. harmonic signals), 
I am using a little dirty trick in the tuneR package, but that makes the 
assumption of having a high frequency sample of a harmonic signal 
without too much noise.


Uwe Ligges


Duncan Murdoch



I have tried:
-Using matrices instead of dataframes - (works for calculations but
not enough for plots)
-increasing the max usable memory (max-mem-size) - (no change)
-increasing the size of the pointer protection stack (max-ppsize) -
(no change)
-deleting the unnecessary leftover matrices - (no change)
-I can't use lines() instead of plot() because of the very different
scales (rpm-1, flags -1to3)

I am going to do some resampling of the logged data to reduce the
vector sizes.
(removal of *less* important data and use of window.ts())

But I am currently running out of ideas...
So if sombody could point out something, I would be greatfull.

Thanks,

Jonathan Gabris

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Jonathan Gabris

On 27/04/2011 13:18, Mike Marchywka wrote:

   Date: Wed, 27 Apr 2011 11:16:26 +0200
   From:jonat...@k-m-p.nl
   To:r-help@r-project.org
   Subject: [R] Speed up plotting to MSWindows graphics window
 
   Hello,
 
   I am working on a project analysing the performance of motor-vehicles
   through messages logged over a CAN bus.
 
   I am using R 2.12 on Windows XP and 7
 
   I am currently plotting the data in R, overlaying 5 or more plots of
   data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
   The aim is to be able to pan, zoom in and out and get values from the
   plotted graph using a custom Qt interface that is used as a front end to
   R.exe (all this works).
   The plot is drawn by R directly to the windows graphic device.
 
   The data is imported from a .csv file (typically around 100MB) to a 
  matrix.
   (timestamp, message ID, byte0, byte1, ..., byte7)
   I then separate this matrix into several by message ID (dimensions are
   in the order of 8cols, 10^6 rows)
 
   The panning is done by redrawing the plots, shifted by a small amount.
   So as to view a window of data from a second to a minute long that can
   travel the length of the logged data.
 
   My problem is that, the redrawing of the plots whilst panning is too
   slow when dealing with this much data.
   i.e.: I can see the last graphs being drawn to the screen in the
   half-second following the view change.
   I need a fluid change from one view to the next.
 
   My question is this:
   Are there ways to speed up the plotting on the MSWindows display?
   By reducing plotted point densities to*sensible*  values?
 Well, hard to know but it would help to know where all the time is going.
 Usually people start complaining when VM thrashing is common but if you are
 CPU limited you could try restricting the range of data you want to plot
 rather than relying on the plot to just clip the largely irrelevant points
 when you are zoomed in. It should not be too expensive to find the
 limits either incrementally or with binary search on ordered time series.
 Presumably subsetting is fast using  foo[a:b,]

 One thing you may want to try for change of scale is wavelet  or
 multi-resolution analysis. You can make a tree ( increasing memory usage
 but even VM here may not be a big penalty if coherence is high ) and
 display the resolution appropriate for the current scale.


I forgot to add, for plotting I use a command similar to:

 plot.ts(timestampVector, dataVector, xlim=c(a,b))

a and b are timestamps from timestampVector

Is the xlim parameter sufficient for limiting the scope of the plots?
Or should I subset the timeseries each time I do a plot?


The multi-resolution analysis looks interesting.
I shall spend some time finding out how to use the wavelets package.

Cheers!


   Using something other than plot.ts() - is the lattice package faster?
   I don't need publication quality plots, they can be rougher...
 
   I have tried:
   -Using matrices instead of dataframes - (works for calculations but not
   enough for plots)
   -increasing the max usable memory (max-mem-size) - (no change)
   -increasing the size of the pointer protection stack (max-ppsize) - (no
   change)
   -deleting the unnecessary leftover matrices - (no change)
   -I can't use lines() instead of plot() because of the very different
   scales (rpm-1, flags -1to3)
 
   I am going to do some resampling of the logged data to reduce the vector
   sizes.
   (removal of*less*  important data and use of window.ts())
 
   But I am currently running out of ideas...
   So if sombody could point out something, I would be greatfull.
 
   Thanks,
 
   Jonathan Gabris
 
   __
   R-help@r-project.org  mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting 
  guidehttp://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
   

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up plotting to MSWindows graphics window

2011-04-27 Thread Mike Marchywka














 Date: Wed, 27 Apr 2011 14:40:23 +0200
 From: jonat...@k-m-p.nl
 To: r-help@r-project.org
 Subject: Re: [R] Speed up plotting to MSWindows graphics window


 On 27/04/2011 13:18, Mike Marchywka wrote:
 
   Date: Wed, 27 Apr 2011 11:16:26 +0200
   From:jonat...@k-m-p.nl
   To:r-help@r-project.org
   Subject: [R] Speed up plotting to MSWindows graphics window
  
   Hello,
  
   I am working on a project analysing the performance of motor-vehicles
   through messages logged over a CAN bus.
  

  
   I am currently plotting the data in R, overlaying 5 or more plots of
   data, logged at 1kHz, (using plot.ts() and par(new = TRUE)).
   The aim is to be able to pan, zoom in and out and get values from the
   plotted graph using a custom Qt interface that is used as a front end to
   R.exe (all this works).
   The plot is drawn by R directly to the windows graphic device.
  
   The data is imported from a .csv file (typically around 100MB) to a 
   matrix.
   (timestamp, message ID, byte0, byte1, ..., byte7)
   I then separate this matrix into several by message ID (dimensions are
   in the order of 8cols, 10^6 rows)
  
   The panning is done by redrawing the plots, shifted by a small amount.
   So as to view a window of data from a second to a minute long that can
   travel the length of the logged data.
  
   My problem is that, the redrawing of the plots whilst panning is too
   slow when dealing with this much data.
   i.e.: I can see the last graphs being drawn to the screen in the
   half-second following the view change.
   I need a fluid change from one view to the next.
  
   My question is this:
   Are there ways to speed up the plotting on the MSWindows display?
   By reducing plotted point densities to*sensible* values?
  Well, hard to know but it would help to know where all the time is going.
  Usually people start complaining when VM thrashing is common but if you are
  CPU limited you could try restricting the range of data you want to plot
  rather than relying on the plot to just clip the largely irrelevant points
  when you are zoomed in. It should not be too expensive to find the
  limits either incrementally or with binary search on ordered time series.
  Presumably subsetting is fast using foo[a:b,]
 
  One thing you may want to try for change of scale is wavelet or
  multi-resolution analysis. You can make a tree ( increasing memory usage
  but even VM here may not be a big penalty if coherence is high ) and
  display the resolution appropriate for the current scale.
 
 
 I forgot to add, for plotting I use a command similar to:

 plot.ts(timestampVector, dataVector, xlim=c(a,b))

 a and b are timestamps from timestampVector

 Is the xlim parameter sufficient for limiting the scope of the plots?
 Or should I subset the timeseries each time I do a plot?

well, maybe time series knows the data to be ordered, I never use
that, but in general it has to go check each point and clip the out
of range ones. It could I suppose binary search for start/end points
but I don't know.Based on what you said it sounds like it does not.




 The multi-resolution analysis looks interesting.
 I shall spend some time finding out how to use the wavelets package.

 Cheers!

 
   Using something other than plot.ts() - is the lattice package faster?
   I don't need publication quality plots, they can be rougher...
  
   I have tried:
   -Using matrices instead of dataframes - (works for calculations but not
   enough for plots)
   -increasing the max usable memory (max-mem-size) - (no change)
   -increasing the size of the pointer protection stack (max-ppsize) - (no
   change)
   -deleting the unnecessary leftover matrices - (no change)
   -I can't use lines() instead of plot() because of the very different
   scales (rpm-1, flags -1to3)
  
   I am going to do some resampling of the logged data to reduce the vector
   sizes.
   (removal of*less* important data and use of window.ts())
  
   But I am currently running out of ideas...
   So if sombody could point out something, I would be greatfull.
  
   Thanks,
  
   Jonathan Gabris
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting 
   guidehttp://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
 

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting

Re: [R] Speed up sum of outer products?

2011-03-16 Thread AjayT
Hi Stefan,

thats really interesting - I never though of trying to benchmark Linux-64
against OSX (a friend who works on large databases, says OSX performs better
than Linux in his work!). Thanks for posting your comparison, and your hints
:)

i) I guess you have a very fast CPU (Core i7 or so, I guess?),  - only quad
core i5 but I'm trying to get access to a quad core i7, might make a
difference for openCL code?

ii) a very poor BLAS implementation - I installed the latest ATLAS package
for Ubuntu 10.04 LTS, which gives a x6 speed up?? I'm tempted  interested
in recompiling R-2.12.2 linked to the MKL (which I guess the vecLib BLAS
library uses ?), but it seems a tricky thing to do ?? To be honest I'm not
sure how this new ATLAS library works, i.e. is it seqential or
mulithtreaded?

iii) and a desktop graphics card - installed a GTX570 today which has 480
cuda cores, my previous card had 16 cores and half the bandwidth

The results of a setup with the new ATLAS library and GTX570 are a pleasant
improvement :).

  user  system elapsed--  for loop, single thread
 29.790   7.400  37.243
   user  system elapsed   -- new ATLAS, t(X)%*%X
  1.480   0.000   1.479
   user  system elapsed   -- new ATLAS, crossprod(X)
  0.740   0.000   0.739
   user  system elapsed   -- new GPU, gputools::crossprod(X)
*  0.190   0.040   0.228*

I would be really interested to find out what the results would be on a OSX
machine with a fancy GPU. I read that a 2x512 core card is going to be
released by Nvidia in the next couple of weeks, and CUDA 4.0 is due for
public release in a few months. So may be you want to keep CUDA on your
radar?

I managed to write my first R function/package using CUDA code at the
weekend. Its a fairly simple but tedious process once you have some CUDA
code which compiles, and all you want to do is to port it to R. (in the Unix
case at least). For example you can write a simple C wrapper along the lines
of the rinterface.c code in gputools. Then modify the Makefile.in and
configure.ac files in this package as required, and you should be set to
configure, make and install into R.

I'm working on non-parametric regression, and optimization at the moment and
the speed up using CUDA has been worth the effort :)

All the best,

Ajay






On 15 March 2011 11:22, Stefan Evert-3 [via R] 
ml-node+3356302-1299160144-215...@n4.nabble.com wrote:

 Hi Ajay,

 thanks for this comparison, which prodded me to give CUDA another try on my
 now somewhat aging MacBook Pro.

  Hi Dennis, sorry for the delayed reply and thanks for the article. I
 digged
  into it and found that if you have a GPU, the CUBLAS library beats the
  BLAS/ATLAS implementation in the Matrix package for 'large' problems.

 I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor
 BLAS implementation and a desktop graphics card?

user  system elapsed-- for loop, single thread
  27.210   6.680  33.342
user  system elapsed-- BLAS mat mult
   6.260   0.000   5.982
user  system elapsed-- BLAS crossprod
   4.340   0.000   4.284
user  system elapsed-- CUDA gpuCrossprod
1.490.001.48

 Just to put these numbers in perspective, here are my results for a MacBook
 Pro running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia
 GeForce 8600M GT with 512 MB RAM -- I suppose it's the M that breaks my
 performance here).

 user  system elapsed-- for loop, single thread
  141.034  35.299 153.783
 user  system elapsed-- BLAS mat mult
2.791   0.025   1.805
 user  system elapsed-- BLAS crossprod
1.419   0.039   0.863
 user  system elapsed-- CUDA gpuCrossprod
1.431   0.119   1.718


 As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is
 slightly slower (my card has 32 cores, but may have lower memory bandwidth
 and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by
 a factor of 2.


 Kudos to the gputools developers: despite what the README says, the package
 compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release
 3.2.  Thanks for this convenient package!


 Best regards,
 Stefan Evert

 [ [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3356302i=0by-user=t|
 http://purl.org/stefan.evert ]

 __
 [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=3356302i=1by-user=tmailing 
 list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3356302.html
  To unsubscribe from Speed up sum of outer products?, click 
 

Re: [R] Speed up sum of outer products?

2011-03-15 Thread AjayT
Hi Dennis, sorry for the delayed reply and thanks for the article. I digged
into it and found that if you have a GPU, the CUBLAS library beats the
BLAS/ATLAS implementation in the Matrix package for 'large' problems. Here's
what I mean,

its = 2500
dim = 1750

X = matrix(rnorm(its*dim),its, dim) 

system.time({C=matrix(0, dim, dim);for(i in 1:its)C = C + (X[i,] %o%
X[i,])}) # single thread breakup calculation 
system.time({C1 = t(X) %*% X})   
# single thread - BLAS matrix mult
system.time({C2 = crossprod(X)})  
# single thread - BLAS matrix mult
library(gputools)
system.time({C3 = gpuCrossprod(X, X)}) 
# multithread - CUBLAS cublasSgemm function
print(all.equal(C,C1,C2,C3))
   user  system elapsed 
 27.210   6.680  33.342 
   user  system elapsed 
  6.260   0.000   5.982 
   user  system elapsed 
  4.340   0.000   4.284 
   user  system elapsed 
   1.490.001.48 
[1] TRUE

The last line shows a x3 speed up, using my dated graphics card which has 16
cores, compared to my cpu which is a quad core. I should be able to try this
out on a 512 core card in the next few days, and will post the result.

All the best,

Aj 

--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3355139.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up sum of outer products?

2011-03-15 Thread Stefan Evert
Hi Ajay,

thanks for this comparison, which prodded me to give CUDA another try on my now 
somewhat aging MacBook Pro.

 Hi Dennis, sorry for the delayed reply and thanks for the article. I digged
 into it and found that if you have a GPU, the CUBLAS library beats the
 BLAS/ATLAS implementation in the Matrix package for 'large' problems.

I guess you have a very fast CPU (Core i7 or so, I guess?), a very poor BLAS 
implementation and a desktop graphics card?

   user  system elapsed-- for loop, single thread
 27.210   6.680  33.342 
   user  system elapsed-- BLAS mat mult
  6.260   0.000   5.982 
   user  system elapsed-- BLAS crossprod
  4.340   0.000   4.284 
   user  system elapsed-- CUDA gpuCrossprod
   1.490.001.48 

Just to put these numbers in perspective, here are my results for a MacBook Pro 
running Mac OS X 10.6.6 (Core 2 Duo, 2.5 GHz, 6 GB DDR2 RAM, Nvidia GeForce 
8600M GT with 512 MB RAM -- I suppose it's the M that breaks my performance 
here).

user  system elapsed-- for loop, single thread 
 141.034  35.299 153.783 
user  system elapsed-- BLAS mat mult
   2.791   0.025   1.805 
user  system elapsed-- BLAS crossprod
   1.419   0.039   0.863 
user  system elapsed-- CUDA gpuCrossprod
   1.431   0.119   1.718 


As you can see, my CPU/RAM is about 5x slower than your machine, CUDA is 
slightly slower (my card has 32 cores, but may have lower memory bandwidth 
and/or clock rate if yours is a desktop card), but vecLib BLAS beats CUDA by a 
factor of 2.


Kudos to the gputools developers: despite what the README says, the package 
compiles out of the box on Mac OS X 10.6, 64-bit R 2.12.1, with CUDA release 
3.2.  Thanks for this convenient package!


Best regards,
Stefan Evert

[ stefan.ev...@uos.de | http://purl.org/stefan.evert ]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Speed up sum of outer products?

2011-03-01 Thread AjayT
Hi, I'm new to R and stats, and I'm trying to speed up the following sum,

for (i in 1:n){
C = C + (X[i,] %o% X[i,])   # the sum of outer products - this is very 
slow
according to Rprof()
}

where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to
be calculated over 10,000 times for different X. 

I think it is similar to estimating a co-variance matrix for demeaned data
X. I tried using cov, but got different answers, and it was'nt much quicker?

Any help gratefully appreciated, 

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330160.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up sum of outer products?

2011-03-01 Thread Phil Spector
What you're doing is breaking up the calculation of X'X 
into n steps.   I'm not sure what you mean by very slow:



X = matrix(rnorm(1000*50),1000,50)
n = 1000
system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])})

   user  system elapsed
  0.096   0.008   0.104

Of course, you could just do the calculation directly:


system.time({C1 = t(X) %*% X})

   user  system elapsed
  0.008   0.000   0.007 

all.equal(C,C1)

[1] TRUE


- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu



On Tue, 1 Mar 2011, AjayT wrote:


Hi, I'm new to R and stats, and I'm trying to speed up the following sum,

for (i in 1:n){
C = C + (X[i,] %o% X[i,])   # the sum of outer products - this is very 
slow
according to Rprof()
}

where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to
be calculated over 10,000 times for different X.

I think it is similar to estimating a co-variance matrix for demeaned data
X. I tried using cov, but got different answers, and it was'nt much quicker?

Any help gratefully appreciated,

--
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330160.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up sum of outer products?

2011-03-01 Thread Doran, Harold
Isn't the following the canonical (R-ish) way of doing this:

X = matrix(rnorm(1000*50),1000,50)
system.time({C1 = t(X) %*% X}) # Phil's example

C2 - crossprod(X) # use crossprod instead

 all.equal(C1,C2)
[1] TRUE

 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf Of Phil Spector
 Sent: Tuesday, March 01, 2011 12:31 PM
 To: AjayT
 Cc: r-help@r-project.org
 Subject: Re: [R] Speed up sum of outer products?
 
 What you're doing is breaking up the calculation of X'X
 into n steps.   I'm not sure what you mean by very slow:
 
  X = matrix(rnorm(1000*50),1000,50)
  n = 1000
  system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])})
 user  system elapsed
0.096   0.008   0.104
 
 Of course, you could just do the calculation directly:
 
  system.time({C1 = t(X) %*% X})
 user  system elapsed
0.008   0.000   0.007
  all.equal(C,C1)
 [1] TRUE
 
 
   - Phil Spector
Statistical Computing Facility
Department of Statistics
UC Berkeley
spec...@stat.berkeley.edu
 
 
 
 On Tue, 1 Mar 2011, AjayT wrote:
 
  Hi, I'm new to R and stats, and I'm trying to speed up the following sum,
 
  for (i in 1:n){
  C = C + (X[i,] %o% X[i,])   # the sum of outer products - this is very
 slow
  according to Rprof()
  }
 
  where X is a data matrix (nrows=1000 X ncols=50), and n=1000. The sum has to
  be calculated over 10,000 times for different X.
 
  I think it is similar to estimating a co-variance matrix for demeaned data
  X. I tried using cov, but got different answers, and it was'nt much quicker?
 
  Any help gratefully appreciated,
 
  --
  View this message in context: http://r.789695.n4.nabble.com/Speed-up-sum-of-
 outer-products-tp3330160p3330160.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up sum of outer products?

2011-03-01 Thread AjayT
Hey thanks alot guys !!! That really speeds things up !!! I didn't know %*%
and crossprod, could operate on matrices. I think you've saved me hours in
calculation time. Thanks again.

 system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])}) 
   user  system elapsed 
   0.450.000.90 
 system.time({C1 = t(X) %*% X}) 
   user  system elapsed 
   0.020.000.05 
 system.time({C2 = crossprod(X)})
   user  system elapsed 
   0.020.000.02 

-- 
View this message in context: 
http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330378.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Speed up sum of outer products?

2011-03-01 Thread Dennis Murphy
...and this is where we cue the informative article on least squares
calculations in R by Doug Bates:

http://cran.r-project.org/doc/Rnews/Rnews_2004-1.pdf

HTH,
Dennis

On Tue, Mar 1, 2011 at 10:52 AM, AjayT ajaytal...@googlemail.com wrote:

 Hey thanks alot guys !!! That really speeds things up !!! I didn't know %*%
 and crossprod, could operate on matrices. I think you've saved me hours in
 calculation time. Thanks again.

  system.time({C=matrix(0,50,50);for(i in 1:n)C = C + (X[i,] %o% X[i,])})
   user  system elapsed
0.450.000.90
  system.time({C1 = t(X) %*% X})
   user  system elapsed
0.020.000.05
  system.time({C2 = crossprod(X)})
   user  system elapsed
   0.020.000.02

 --
 View this message in context:
 http://r.789695.n4.nabble.com/Speed-up-sum-of-outer-products-tp3330160p3330378.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed up process

2011-02-25 Thread Ivan Calandra

Dear users,

I have a double for loop that does exactly what I want, but is quite 
slow. It is not so much with this simplified example, but IRL it is slow.

Can anyone help me improve it?

The data and code for foo_reg() are available at the end of the email; I 
preferred going directly into the problematic part.
Here is the code (I tried to simplify it but I cannot do it too much or 
else it wouldn't represent my problem). It might also look too complex 
for what it is intended to do, but my colleagues who are also supposed 
to use it don't know much about R. So I wrote it so that they don't have 
to modify the critical parts to run the script for their needs.


#column indexes for function
ind.xvar - 2
seq.yvar - 3:4
#position vector for legend(), stupid positioning but it doesn't matter here
mypos - c(topleft, topright,bottomleft)

#run the function for columns 34 as y (seq.yvar) with column 2 as x 
(ind.xvar) for all 3 datasets (mydata_list)

par(mfrow=c(2,1))
for (i in seq_along(seq.yvar)){
  k - seq.yvar[i]
  plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, 
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])

  for (j in seq_along(mydata_list)){
foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, 
pos=mypos[j], name.dat=names(mydata_list)[j])

  }
}

I tried with lapply() or mapply() but couldn't manage to pass the 
arguments for names() and col= correctly, e.g. for the 2nd loop:
lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, 
yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, 
mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))


Thanks in advance for any hints.
Ivan




#create data (it looks horrible with these datasets but it doesn't 
matter here)
mydata1 - structure(list(species = structure(1:8, .Label = c(alsen, 
gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = 
factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = 
c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 
119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 
43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 
50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = 
c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = 
data.frame)


mydata2 - mydata1[!(mydata1$species %in% c(thgel,alsen)),]
mydata3 - mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),]
mydata_list - list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3)

#function for regression
library(WRS)
foo_reg - function(dat, xvar, yvar, mycol, pos, name.dat){
 tsts - tstsreg(dat[[xvar]], dat[[yvar]])
 tsts_inter - signif(tsts$coef[1], digits=3)
 tsts_slope - signif(tsts$coef[2], digits=3)
 abline(tsts$coef, lty=1, col=mycol)
 legend(x=pos, legend=c(paste(TSTS ,name.dat,: 
Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol)

}

--
Ivan CALANDRA
PhD Student
University of Hamburg
Biozentrum Grindel und Zoologisches Museum
Abt. Säugetiere
Martin-Luther-King-Platz 3
D-20146 Hamburg, GERMANY
+49(0)40 42838 6231
ivan.calan...@uni-hamburg.de

**
http://www.for771.uni-bonn.de
http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up process

2011-02-25 Thread Nick Sabbe
Simply avoiding the for loops by using lapply (I may have missed a bracket
here or there cause I did this without opening R)...
Haven't checked the speed up, though.

lapply(seq.yvar, function(k){
   plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
   lapply(seq_along(mydata_list), function(j){
 foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
 return(NULL)
   })
   invisible(NULL)
})

HTH,

Nick Sabbe
--
ping: nick.sa...@ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove




-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Ivan Calandra
Sent: vrijdag 25 februari 2011 11:20
To: r-help
Subject: [R] speed up process

Dear users,

I have a double for loop that does exactly what I want, but is quite 
slow. It is not so much with this simplified example, but IRL it is slow.
Can anyone help me improve it?

The data and code for foo_reg() are available at the end of the email; I 
preferred going directly into the problematic part.
Here is the code (I tried to simplify it but I cannot do it too much or 
else it wouldn't represent my problem). It might also look too complex 
for what it is intended to do, but my colleagues who are also supposed 
to use it don't know much about R. So I wrote it so that they don't have 
to modify the critical parts to run the script for their needs.

#column indexes for function
ind.xvar - 2
seq.yvar - 3:4
#position vector for legend(), stupid positioning but it doesn't matter here
mypos - c(topleft, topright,bottomleft)

#run the function for columns 34 as y (seq.yvar) with column 2 as x 
(ind.xvar) for all 3 datasets (mydata_list)
par(mfrow=c(2,1))
for (i in seq_along(seq.yvar)){
   k - seq.yvar[i]
   plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p, 
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
   for (j in seq_along(mydata_list)){
 foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j, 
pos=mypos[j], name.dat=names(mydata_list)[j])
   }
}

I tried with lapply() or mapply() but couldn't manage to pass the 
arguments for names() and col= correctly, e.g. for the 2nd loop:
lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar, 
yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])}, 
mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))

Thanks in advance for any hints.
Ivan




#create data (it looks horrible with these datasets but it doesn't 
matter here)
mydata1 - structure(list(species = structure(1:8, .Label = c(alsen, 
gogor, loalb, mafas, pacyn, patro, poabe, thgel), class = 
factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc = 
c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809, 
119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483, 
43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651, 
50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names = 
c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class = 
data.frame)

mydata2 - mydata1[!(mydata1$species %in% c(thgel,alsen)),]
mydata3 - mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),]
mydata_list - list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3)

#function for regression
library(WRS)
foo_reg - function(dat, xvar, yvar, mycol, pos, name.dat){
  tsts - tstsreg(dat[[xvar]], dat[[yvar]])
  tsts_inter - signif(tsts$coef[1], digits=3)
  tsts_slope - signif(tsts$coef[2], digits=3)
  abline(tsts$coef, lty=1, col=mycol)
  legend(x=pos, legend=c(paste(TSTS ,name.dat,: 
Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol)
}

-- 
Ivan CALANDRA
PhD Student
University of Hamburg
Biozentrum Grindel und Zoologisches Museum
Abt. Säugetiere
Martin-Luther-King-Platz 3
D-20146 Hamburg, GERMANY
+49(0)40 42838 6231
ivan.calan...@uni-hamburg.de

**
http://www.for771.uni-bonn.de
http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up process

2011-02-25 Thread Ivan Calandra

Thanks Nick for your quick answer.
It does work (no missed bracket!) but unfortunately doesn't really speed 
up anything: with my real data, it takes 82.78 seconds with the double 
lapply() instead of 83.59s with the double loop (about 0.8 s).


It looks like my double loop was not that bad. Does anyone know another 
faster way to do this?


Thanks again in advance,
Ivan

Le 2/25/2011 11:41, Nick Sabbe a écrit :

Simply avoiding the for loops by using lapply (I may have missed a bracket
here or there cause I did this without opening R)...
Haven't checked the speed up, though.

lapply(seq.yvar, function(k){
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
lapply(seq_along(mydata_list), function(j){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
  return(NULL)
})
invisible(NULL)
})

HTH,

Nick Sabbe
--
ping: nick.sa...@ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove




-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Ivan Calandra
Sent: vrijdag 25 februari 2011 11:20
To: r-help
Subject: [R] speed up process

Dear users,

I have a double for loop that does exactly what I want, but is quite
slow. It is not so much with this simplified example, but IRL it is slow.
Can anyone help me improve it?

The data and code for foo_reg() are available at the end of the email; I
preferred going directly into the problematic part.
Here is the code (I tried to simplify it but I cannot do it too much or
else it wouldn't represent my problem). It might also look too complex
for what it is intended to do, but my colleagues who are also supposed
to use it don't know much about R. So I wrote it so that they don't have
to modify the critical parts to run the script for their needs.

#column indexes for function
ind.xvar- 2
seq.yvar- 3:4
#position vector for legend(), stupid positioning but it doesn't matter here
mypos- c(topleft, topright,bottomleft)

#run the function for columns 34 as y (seq.yvar) with column 2 as x
(ind.xvar) for all 3 datasets (mydata_list)
par(mfrow=c(2,1))
for (i in seq_along(seq.yvar)){
k- seq.yvar[i]
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
for (j in seq_along(mydata_list)){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
}
}

I tried with lapply() or mapply() but couldn't manage to pass the
arguments for names() and col= correctly, e.g. for the 2nd loop:
lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar,
yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])},
mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))

Thanks in advance for any hints.
Ivan




#create data (it looks horrible with these datasets but it doesn't
matter here)
mydata1- structure(list(species = structure(1:8, .Label = c(alsen,
gogor, loalb, mafas, pacyn, patro, poabe, thgel), class =
factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc =
c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809,
119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483,
43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651,
50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names =
c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class =
data.frame)

mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),]
mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),]
mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3)

#function for regression
library(WRS)
foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){
   tsts- tstsreg(dat[[xvar]], dat[[yvar]])
   tsts_inter- signif(tsts$coef[1], digits=3)
   tsts_slope- signif(tsts$coef[2], digits=3)
   abline(tsts$coef, lty=1, col=mycol)
   legend(x=pos, legend=c(paste(TSTS ,name.dat,:
Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol)
}



--
Ivan CALANDRA
PhD Student
University of Hamburg
Biozentrum Grindel und Zoologisches Museum
Abt. Säugetiere
Martin-Luther-King-Platz 3
D-20146 Hamburg, GERMANY
+49(0)40 42838 6231
ivan.calan...@uni-hamburg.de

**
http://www.for771.uni-bonn.de
http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up process

2011-02-25 Thread Jim Holtman
use Rprof to find where time is being spent.  probably in 'plot' which might 
imply it is not the 'for' loop and therefore beyond your control.

Sent from my iPad

On Feb 25, 2011, at 6:19, Ivan Calandra ivan.calan...@uni-hamburg.de wrote:

 Thanks Nick for your quick answer.
 It does work (no missed bracket!) but unfortunately doesn't really speed up 
 anything: with my real data, it takes 82.78 seconds with the double lapply() 
 instead of 83.59s with the double loop (about 0.8 s).
 
 It looks like my double loop was not that bad. Does anyone know another 
 faster way to do this?
 
 Thanks again in advance,
 Ivan
 
 Le 2/25/2011 11:41, Nick Sabbe a écrit :
 Simply avoiding the for loops by using lapply (I may have missed a bracket
 here or there cause I did this without opening R)...
 Haven't checked the speed up, though.
 
 lapply(seq.yvar, function(k){
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
 xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
lapply(seq_along(mydata_list), function(j){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
 pos=mypos[j], name.dat=names(mydata_list)[j])
  return(NULL)
})
invisible(NULL)
 })
 
 HTH,
 
 Nick Sabbe
 --
 ping: nick.sa...@ugent.be
 link: http://biomath.ugent.be
 wink: A1.056, Coupure Links 653, 9000 Gent
 ring: 09/264.59.36
 
 -- Do Not Disapprove
 
 
 
 
 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf Of Ivan Calandra
 Sent: vrijdag 25 februari 2011 11:20
 To: r-help
 Subject: [R] speed up process
 
 Dear users,
 
 I have a double for loop that does exactly what I want, but is quite
 slow. It is not so much with this simplified example, but IRL it is slow.
 Can anyone help me improve it?
 
 The data and code for foo_reg() are available at the end of the email; I
 preferred going directly into the problematic part.
 Here is the code (I tried to simplify it but I cannot do it too much or
 else it wouldn't represent my problem). It might also look too complex
 for what it is intended to do, but my colleagues who are also supposed
 to use it don't know much about R. So I wrote it so that they don't have
 to modify the critical parts to run the script for their needs.
 
 #column indexes for function
 ind.xvar- 2
 seq.yvar- 3:4
 #position vector for legend(), stupid positioning but it doesn't matter here
 mypos- c(topleft, topright,bottomleft)
 
 #run the function for columns 34 as y (seq.yvar) with column 2 as x
 (ind.xvar) for all 3 datasets (mydata_list)
 par(mfrow=c(2,1))
 for (i in seq_along(seq.yvar)){
k- seq.yvar[i]
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
 xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
for (j in seq_along(mydata_list)){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
 pos=mypos[j], name.dat=names(mydata_list)[j])
}
 }
 
 I tried with lapply() or mapply() but couldn't manage to pass the
 arguments for names() and col= correctly, e.g. for the 2nd loop:
 lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar,
 yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
 mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])},
 mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))
 
 Thanks in advance for any hints.
 Ivan
 
 
 
 
 #create data (it looks horrible with these datasets but it doesn't
 matter here)
 mydata1- structure(list(species = structure(1:8, .Label = c(alsen,
 gogor, loalb, mafas, pacyn, patro, poabe, thgel), class =
 factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc =
 c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809,
 119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483,
 43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651,
 50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names =
 c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class =
 data.frame)
 
 mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),]
 mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),]
 mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3)
 
 #function for regression
 library(WRS)
 foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){
   tsts- tstsreg(dat[[xvar]], dat[[yvar]])
   tsts_inter- signif(tsts$coef[1], digits=3)
   tsts_slope- signif(tsts$coef[2], digits=3)
   abline(tsts$coef, lty=1, col=mycol)
   legend(x=pos, legend=c(paste(TSTS ,name.dat,:
 Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col=mycol)
 }
 
 
 -- 
 Ivan CALANDRA
 PhD Student
 University of Hamburg
 Biozentrum Grindel und Zoologisches Museum
 Abt. Säugetiere
 Martin-Luther-King-Platz 3
 D-20146 Hamburg, GERMANY
 +49(0)40 42838 6231
 ivan.calan...@uni-hamburg.de
 
 **
 http://www.for771.uni-bonn.de
 http://webapp5.rrz.uni-hamburg.de/mammals/eng/1525_8_1.php
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman

Re: [R] speed up process

2011-02-25 Thread Ivan Calandra

Dear Jim,

I've tried to use Rprof() as you advised me, but I don't understand how 
it works.

I've done this:
Rprof(for (i in seq_along(seq.yvar)){
  all_my_commands
})
summaryRprof()

But I got this error:
Error in summaryRprof() : no lines found in ‘Rprof.out’

I couldn't really understand from the help page what I should do.

In any case, it's sure that the function tstsreg(), is what takes the 
most computing time. But I wanted to optimize the rest of the code to 
gain as much speed as possible.


Ivan

Le 2/25/2011 12:30, Jim Holtman a écrit :

use Rprof to find where time is being spent.  probably in 'plot' which might 
imply it is not the 'for' loop and therefore beyond your control.

Sent from my iPad

On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de  wrote:


Thanks Nick for your quick answer.
It does work (no missed bracket!) but unfortunately doesn't really speed up 
anything: with my real data, it takes 82.78 seconds with the double lapply() 
instead of 83.59s with the double loop (about 0.8 s).

It looks like my double loop was not that bad. Does anyone know another faster 
way to do this?

Thanks again in advance,
Ivan

Le 2/25/2011 11:41, Nick Sabbe a écrit :

Simply avoiding the for loops by using lapply (I may have missed a bracket
here or there cause I did this without opening R)...
Haven't checked the speed up, though.

lapply(seq.yvar, function(k){
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
lapply(seq_along(mydata_list), function(j){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
  return(NULL)
})
invisible(NULL)
})

HTH,

Nick Sabbe
--
ping: nick.sa...@ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove




-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Ivan Calandra
Sent: vrijdag 25 februari 2011 11:20
To: r-help
Subject: [R] speed up process

Dear users,

I have a double for loop that does exactly what I want, but is quite
slow. It is not so much with this simplified example, but IRL it is slow.
Can anyone help me improve it?

The data and code for foo_reg() are available at the end of the email; I
preferred going directly into the problematic part.
Here is the code (I tried to simplify it but I cannot do it too much or
else it wouldn't represent my problem). It might also look too complex
for what it is intended to do, but my colleagues who are also supposed
to use it don't know much about R. So I wrote it so that they don't have
to modify the critical parts to run the script for their needs.

#column indexes for function
ind.xvar- 2
seq.yvar- 3:4
#position vector for legend(), stupid positioning but it doesn't matter here
mypos- c(topleft, topright,bottomleft)

#run the function for columns 34 as y (seq.yvar) with column 2 as x
(ind.xvar) for all 3 datasets (mydata_list)
par(mfrow=c(2,1))
for (i in seq_along(seq.yvar)){
k- seq.yvar[i]
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
for (j in seq_along(mydata_list)){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
}
}

I tried with lapply() or mapply() but couldn't manage to pass the
arguments for names() and col= correctly, e.g. for the 2nd loop:
lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar,
yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])},
mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))

Thanks in advance for any hints.
Ivan




#create data (it looks horrible with these datasets but it doesn't
matter here)
mydata1- structure(list(species = structure(1:8, .Label = c(alsen,
gogor, loalb, mafas, pacyn, patro, poabe, thgel), class =
factor), fruit = c(0.52, 0.45, 0.43, 0.82, 0.35, 0.9, 0.68, 0), Asfc =
c(207.463765, 138.5533755, 70.4391735, 160.9742745, 41.455809,
119.155109, 26.241441, 148.337377), Tfv = c(47068.1437773483,
43743.8087431582, 40323.5209129239, 23420.9455581495, 29382.6947428651,
50460.2202192311, 21810.1456510625, 41747.6053810881)), .Names =
c(species, fruit, Asfc, Tfv), row.names = c(NA, 8L), class =
data.frame)

mydata2- mydata1[!(mydata1$species %in% c(thgel,alsen)),]
mydata3- mydata1[!(mydata1$species %in% c(thgel,alsen,poabe)),]
mydata_list- list(mydata1=mydata1, mydata2=mydata2, mydata3=mydata3)

#function for regression
library(WRS)
foo_reg- function(dat, xvar, yvar, mycol, pos, name.dat){
   tsts- tstsreg(dat[[xvar]], dat[[yvar]])
   tsts_inter- signif(tsts$coef[1], digits=3)
   tsts_slope- signif(tsts$coef[2], digits=3)
   abline(tsts$coef, lty=1, col=mycol)
   legend(x=pos, legend=c(paste(TSTS ,name.dat,:
Y=,tsts_inter,+,tsts_slope,X,sep=)), lty=1, col

Re: [R] speed up process

2011-02-25 Thread jim holtman
You invoke Rprof, run your code and then terminate it:


Rprof()
... code you want to profile
Rprof(NULL)  # generate output
summaryRprof()

example:


 Rprof()
 for (i in 1:1e6) sin(i) + cos(i) + sqrt(i)
 Rprof(NULL)
 summaryRprof()
$by.self
 self.time self.pct total.time total.pct
sin   0.2430.77   0.24 30.77
sqrt  0.2228.21   0.22 28.21
cos   0.1620.51   0.16 20.51
+ 0.1417.95   0.14 17.95
: 0.02 2.56   0.02  2.56

$by.total
 total.time total.pct self.time self.pct
sin0.24 30.77  0.2430.77
sqrt   0.22 28.21  0.2228.21
cos0.16 20.51  0.1620.51
+  0.14 17.95  0.1417.95
:  0.02  2.56  0.02 2.56

$sample.interval
[1] 0.02

$sampling.time
[1] 0.78


On Fri, Feb 25, 2011 at 6:57 AM, Ivan Calandra
ivan.calan...@uni-hamburg.de wrote:
 Dear Jim,

 I've tried to use Rprof() as you advised me, but I don't understand how it
 works.
 I've done this:
 Rprof(for (i in seq_along(seq.yvar)){
  all_my_commands
 })
 summaryRprof()

 But I got this error:
 Error in summaryRprof() : no lines found in ‘Rprof.out’

 I couldn't really understand from the help page what I should do.

 In any case, it's sure that the function tstsreg(), is what takes the most
 computing time. But I wanted to optimize the rest of the code to gain as
 much speed as possible.

 Ivan

 Le 2/25/2011 12:30, Jim Holtman a écrit :

 use Rprof to find where time is being spent.  probably in 'plot' which
 might imply it is not the 'for' loop and therefore beyond your control.

 Sent from my iPad

 On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de
  wrote:

 Thanks Nick for your quick answer.
 It does work (no missed bracket!) but unfortunately doesn't really speed
 up anything: with my real data, it takes 82.78 seconds with the double
 lapply() instead of 83.59s with the double loop (about 0.8 s).

 It looks like my double loop was not that bad. Does anyone know another
 faster way to do this?

 Thanks again in advance,
 Ivan

 Le 2/25/2011 11:41, Nick Sabbe a écrit :

 Simply avoiding the for loops by using lapply (I may have missed a
 bracket
 here or there cause I did this without opening R)...
 Haven't checked the speed up, though.

 lapply(seq.yvar, function(k){
    plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
 xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
    lapply(seq_along(mydata_list), function(j){
      foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
 pos=mypos[j], name.dat=names(mydata_list)[j])
      return(NULL)
    })
    invisible(NULL)
 })

 HTH,

 Nick Sabbe
 --
 ping: nick.sa...@ugent.be
 link: http://biomath.ugent.be
 wink: A1.056, Coupure Links 653, 9000 Gent
 ring: 09/264.59.36

 -- Do Not Disapprove




 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
 On
 Behalf Of Ivan Calandra
 Sent: vrijdag 25 februari 2011 11:20
 To: r-help
 Subject: [R] speed up process

 Dear users,

 I have a double for loop that does exactly what I want, but is quite
 slow. It is not so much with this simplified example, but IRL it is
 slow.
 Can anyone help me improve it?

 The data and code for foo_reg() are available at the end of the email; I
 preferred going directly into the problematic part.
 Here is the code (I tried to simplify it but I cannot do it too much or
 else it wouldn't represent my problem). It might also look too complex
 for what it is intended to do, but my colleagues who are also supposed
 to use it don't know much about R. So I wrote it so that they don't have
 to modify the critical parts to run the script for their needs.

 #column indexes for function
 ind.xvar- 2
 seq.yvar- 3:4
 #position vector for legend(), stupid positioning but it doesn't matter
 here
 mypos- c(topleft, topright,bottomleft)

 #run the function for columns 34 as y (seq.yvar) with column 2 as x
 (ind.xvar) for all 3 datasets (mydata_list)
 par(mfrow=c(2,1))
 for (i in seq_along(seq.yvar)){
    k- seq.yvar[i]
    plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
 xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
    for (j in seq_along(mydata_list)){
      foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
 pos=mypos[j], name.dat=names(mydata_list)[j])
    }
 }

 I tried with lapply() or mapply() but couldn't manage to pass the
 arguments for names() and col= correctly, e.g. for the 2nd loop:
 lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar,
 yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
 mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])},
 mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar=ind.xvar, yvar=k))

 Thanks in advance for any hints.
 Ivan




 #create data (it looks horrible with these datasets but it doesn't
 matter here)
 mydata1- structure(list(species = structure(1:8, .Label = c(alsen,
 gogor, loalb

Re: [R] speed up process

2011-02-25 Thread Ivan Calandra

Ha... it was way too simple!
I thought it would be like system.time()... my bad. Thanks for the tip!

As we thought, foo_reg() takes most of the computing time, and I cannot 
improve that.

Any ideas of how to improve the rest?

Thanks again for your help
Ivan


Le 2/25/2011 14:29, jim holtman a écrit :

You invoke Rprof, run your code and then terminate it:


Rprof()
... code you want to profile
Rprof(NULL)  # generate output
summaryRprof()

example:



Rprof()
for (i in 1:1e6) sin(i) + cos(i) + sqrt(i)
Rprof(NULL)
summaryRprof()

$by.self
  self.time self.pct total.time total.pct
sin   0.2430.77   0.24 30.77
sqrt  0.2228.21   0.22 28.21
cos   0.1620.51   0.16 20.51
+ 0.1417.95   0.14 17.95
: 0.02 2.56   0.02  2.56

$by.total
  total.time total.pct self.time self.pct
sin0.24 30.77  0.2430.77
sqrt   0.22 28.21  0.2228.21
cos0.16 20.51  0.1620.51
+  0.14 17.95  0.1417.95
:  0.02  2.56  0.02 2.56

$sample.interval
[1] 0.02

$sampling.time
[1] 0.78


On Fri, Feb 25, 2011 at 6:57 AM, Ivan Calandra
ivan.calan...@uni-hamburg.de  wrote:

Dear Jim,

I've tried to use Rprof() as you advised me, but I don't understand how it
works.
I've done this:
Rprof(for (i in seq_along(seq.yvar)){
  all_my_commands
})
summaryRprof()

But I got this error:
Error in summaryRprof() : no lines found in ‘Rprof.out’

I couldn't really understand from the help page what I should do.

In any case, it's sure that the function tstsreg(), is what takes the most
computing time. But I wanted to optimize the rest of the code to gain as
much speed as possible.

Ivan

Le 2/25/2011 12:30, Jim Holtman a écrit :

use Rprof to find where time is being spent.  probably in 'plot' which
might imply it is not the 'for' loop and therefore beyond your control.

Sent from my iPad

On Feb 25, 2011, at 6:19, Ivan Calandraivan.calan...@uni-hamburg.de
  wrote:


Thanks Nick for your quick answer.
It does work (no missed bracket!) but unfortunately doesn't really speed
up anything: with my real data, it takes 82.78 seconds with the double
lapply() instead of 83.59s with the double loop (about 0.8 s).

It looks like my double loop was not that bad. Does anyone know another
faster way to do this?

Thanks again in advance,
Ivan

Le 2/25/2011 11:41, Nick Sabbe a écrit :

Simply avoiding the for loops by using lapply (I may have missed a
bracket
here or there cause I did this without opening R)...
Haven't checked the speed up, though.

lapply(seq.yvar, function(k){
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
lapply(seq_along(mydata_list), function(j){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
  return(NULL)
})
invisible(NULL)
})

HTH,

Nick Sabbe
--
ping: nick.sa...@ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove




-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On
Behalf Of Ivan Calandra
Sent: vrijdag 25 februari 2011 11:20
To: r-help
Subject: [R] speed up process

Dear users,

I have a double for loop that does exactly what I want, but is quite
slow. It is not so much with this simplified example, but IRL it is
slow.
Can anyone help me improve it?

The data and code for foo_reg() are available at the end of the email; I
preferred going directly into the problematic part.
Here is the code (I tried to simplify it but I cannot do it too much or
else it wouldn't represent my problem). It might also look too complex
for what it is intended to do, but my colleagues who are also supposed
to use it don't know much about R. So I wrote it so that they don't have
to modify the critical parts to run the script for their needs.

#column indexes for function
ind.xvar- 2
seq.yvar- 3:4
#position vector for legend(), stupid positioning but it doesn't matter
here
mypos- c(topleft, topright,bottomleft)

#run the function for columns 34 as y (seq.yvar) with column 2 as x
(ind.xvar) for all 3 datasets (mydata_list)
par(mfrow=c(2,1))
for (i in seq_along(seq.yvar)){
k- seq.yvar[i]
plot(mydata1[[k]]~mydata1[[ind.xvar]], type=p,
xlab=names(mydata1)[ind.xvar], ylab=names(mydata1)[k])
for (j in seq_along(mydata_list)){
  foo_reg(dat=mydata_list[[j]], xvar=ind.xvar, yvar=k, mycol=j,
pos=mypos[j], name.dat=names(mydata_list)[j])
}
}

I tried with lapply() or mapply() but couldn't manage to pass the
arguments for names() and col= correctly, e.g. for the 2nd loop:
lapply(mydata_list, FUN=function(x){foo_reg(dat=x, xvar=ind.xvar,
yvar=k, col1=1:3, pos=mypos[1:3], name.dat=names(x)[1:3])})
mapply(FUN=function(x) {foo_reg(dat=x, name.dat=names(x)[1:3])},
mydata_list, col1=1:3, pos=mypos, MoreArgs=list(xvar

Re: [R] speed up the code

2011-02-18 Thread rex.dwyer
Yes, remove the call to intersect, and rely on the results of match to tell you 
whether there is an overlap.  If there are any matches,  all(is.na(index)) will 
be false.  Read help for match.

?match


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Hui Du
Sent: Wednesday, February 16, 2011 6:29 PM
To: r-help@r-project.org
Subject: [R] speed up the code


Hi All,

The following is a snippet of my code. It works fine but it is very slow. Is it 
possible to speed it up by using different data structure or better solution? 
For 4 runs, it takes 8 minutes now. Thanks a lot



fun_activation = function(s_k, s_hat, x_k, s_hat_len)
{
common = intersect(s_k, s_hat);
if(length(common) != 0)
{
index  = match(common, s_k);
round(sum(x_k[index]) * length(common) / (s_hat_len * length(s_k)), 3);
}
else
{
0;
}

}

fun_x = function(a)
{
round(runif(length(a), 0, 1), 2);
}

symbol_len = 50;
PHI_set = 1:symbol_len;

S = matrix(replicate(M * M, sort(sample(PHI_set, sample(symbol_len, 1, M, 
M);
X = matrix(mapply(fun_x, S), M, M);

S_hat = c(28, 34, 35)
S_hat_len = length(S_hat);

  S_hat_matrix = matrix(list(S_hat), M, M);

system.time(
for(I in 1:4)
{
A = matrix(mapply(fun_activation, S, S_hat_matrix, X, S_hat_len), M, M);
}
)



HXD


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




message may contain confidential information. If you are not the designated 
recipient, please notify the sender immediately, and delete the original and 
any copies. Any use of the message by you is prohibited. 
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed up the code

2011-02-16 Thread Hui Du

Hi All,

The following is a snippet of my code. It works fine but it is very slow. Is it 
possible to speed it up by using different data structure or better solution? 
For 4 runs, it takes 8 minutes now. Thanks a lot



fun_activation = function(s_k, s_hat, x_k, s_hat_len)
{
common = intersect(s_k, s_hat);
if(length(common) != 0)
{
index  = match(common, s_k);
round(sum(x_k[index]) * length(common) / (s_hat_len * length(s_k)), 3);
}
else
{
0;
}

}

fun_x = function(a)
{
round(runif(length(a), 0, 1), 2);
}

symbol_len = 50;
PHI_set = 1:symbol_len;

S = matrix(replicate(M * M, sort(sample(PHI_set, sample(symbol_len, 1, M, 
M);
X = matrix(mapply(fun_x, S), M, M);

S_hat = c(28, 34, 35)
S_hat_len = length(S_hat);

  S_hat_matrix = matrix(list(S_hat), M, M);

system.time(
for(I in 1:4)
{
A = matrix(mapply(fun_activation, S, S_hat_matrix, X, S_hat_len), M, M);
}
)



HXD


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up subsetting with certain conditions

2011-01-13 Thread Duke

On 1/12/11 6:44 PM, Duke wrote:


Thanks so much for your suggestion Martin. I had Bioconductor 
installed but I honestly do not know all its applications. Anyway, I 
am testing GenomicRanges with my data now. I will report back when I 
get the result.




I got the results. My code took ~ 580 min ( ~ 10 hrs) to finish, where 
as using GenomicRanges per Martin suggested, it took only 22 min (about 
30 times less!). Thanks so much for this improvement Martin.


D.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] speed up subsetting with certain conditions

2011-01-12 Thread Duke

Hi folks,

I am working on a project that requires subsetting of a found file based 
on some known file. The known file contains several lines like below:


chr132375463237547rs523104280+
chr132375493237550rs520975820+
chr245133264513327rs297692800+
chr245133374513338rs332860090+

where the first column can be chr2, chr1, chr12 etc... The second and 
third are numbers (cordinates). The found file contains lines like:


chr13213435GC
chr13237547TC
chr13237549GT
chr24513326AG
chr24513337CG

where the first column, again, can be chr1, chr2, chr12 etc... and the 
second is a number. What I have to do is to separate the found file to 
two files: one (foundY) contains lines that have the same first column 
and the second column in range of the two columns 2 and 3 of any line of 
known file, and one (foundN) contains lines that do not meet the 
previous condition. For the two examples above, foundN will be the first 
line, and foundY will be the next 4 lines.


What I came up with is this algorithm:

* get the uniq item in the first column of found file (chr1, chr2, 
chr12, chr13 etc...)
* for each of the uniq item, set subset of the known file and the found 
file that have same first column, then scanning each item in the known 
subset to see if any line meets any condition


The code is like below:

## CODE START###
# import known and found files to data frames
known - read.table( known.txt, sep=\t, header=FALSE )
found - read.table( found.txt, sep=\t, header=FALSE, fill=TRUE )

# get the uniq item in first column of found file
found.Chr - as.character(found[!duplicated(found[[1]]),1])

# create two empty result data frames
foundN - found[0,]
foundY - found[0,]

# scan for each of the uniq items
for ( iChr in found.Chr ) {
  # subset of known and found with specific item
  found.iChr - found[found[[1]]==iChr,]
  known.iChr - known[known[[1]]==iChr,]

  # scan through all found subset items
  if ( nrow(known.iChr)0 ) {
for ( i in 1:nrow(found.iChr) ) {
  if ( nrow(known.iChr[known.iChr[[3]]=found.iChr[i,2]  
known.iChr[[2]]=found.iChr[i,2],])==0 ) {

  foundN - rbind( foundN, found.iChr[i,] )
  } else {
  foundY - rbind( foundN, found.iChr[i,] )
  }
}
  }
}

## CODE END###

The code works well, but I tested it for only small known and found 
files. When trying with larger files (the known file can contains ~ 15 
million lines, the found ~ 15k lines), it takes like hrs to run.


I want to speed up the process, and I believe there must be a better 
algorithm to do this with R. My questions are:


* any body has a better algorithm or comments or suggestion?
* I read (google) that matrices work faster than data frame. Can I use 
matrices for this case? (is matrices for numbers only?)
* I read (google) that I should avoid rbind, and prelocate data frame 
for faster speed. How would I do that in this case?


Thank you very much in advance,

Bests,

D.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up subsetting with certain conditions

2011-01-12 Thread Martin Morgan

On 1/12/2011 2:52 PM, Duke wrote:

Hi folks,

I am working on a project that requires subsetting of a found file 
based on some known file. The known file contains several lines like 
below:


chr132375463237547rs523104280+
chr132375493237550rs520975820+
chr245133264513327rs297692800+
chr245133374513338rs332860090+

where the first column can be chr2, chr1, chr12 etc... The second and 
third are numbers (cordinates). The found file contains lines like:


chr13213435GC
chr13237547TC
chr13237549GT
chr24513326AG
chr24513337CG

where the first column, again, can be chr1, chr2, chr12 etc... and the 
second is a number. What I have to do is to separate the found file to 
two files: one (foundY) contains lines that have the same first column 
and the second column in range of the two columns 2 and 3 of any line 
of known file, and one (foundN) contains lines that do not meet the 
previous condition. For the two examples above, foundN will be the 
first line, and foundY will be the next 4 lines.


What I came up with is this algorithm:

* get the uniq item in the first column of found file (chr1, chr2, 
chr12, chr13 etc...)
* for each of the uniq item, set subset of the known file and the 
found file that have same first column, then scanning each item in the 
known subset to see if any line meets any condition


The code is like below:

## CODE START###
# import known and found files to data frames
known - read.table( known.txt, sep=\t, header=FALSE )
found - read.table( found.txt, sep=\t, header=FALSE, fill=TRUE )

# get the uniq item in first column of found file
found.Chr - as.character(found[!duplicated(found[[1]]),1])

# create two empty result data frames
foundN - found[0,]
foundY - found[0,]

# scan for each of the uniq items
for ( iChr in found.Chr ) {
  # subset of known and found with specific item
  found.iChr - found[found[[1]]==iChr,]
  known.iChr - known[known[[1]]==iChr,]

  # scan through all found subset items
  if ( nrow(known.iChr)0 ) {
for ( i in 1:nrow(found.iChr) ) {
  if ( nrow(known.iChr[known.iChr[[3]]=found.iChr[i,2]  
known.iChr[[2]]=found.iChr[i,2],])==0 ) {

  foundN - rbind( foundN, found.iChr[i,] )
  } else {
  foundY - rbind( foundN, found.iChr[i,] )
  }
}
  }
}

## CODE END###

The code works well, but I tested it for only small known and found 
files. When trying with larger files (the known file can contains ~ 15 
million lines, the found ~ 15k lines), it takes like hrs to run.


I want to speed up the process, and I believe there must be a better 
algorithm to do this with R. My questions are:


* any body has a better algorithm or comments or suggestion?


The Bioconductor project has many tools for dealing with 
sequence-related data. With the data


k - read.table(textConnection(
chr132375463237547rs523104280+
chr132375493237550rs520975820+
chr245133264513327rs297692800+
chr245133374513338rs332860090+))

f - read.table(textConnection(
chr13213435GC
chr13237547TC
chr13237549GT
chr24513326AG
chr24513337CG))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr - with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr - with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps - findOverlaps(fgr, kgr)
idx - countOverlaps(fgr, kgr) != 0

resulting in

 idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.

One could write foundY with as.data.frame(fgr[idx]) (maybe a little 
editing) but likely one would want to stay in R / Bioc and do something 
more interesting...


See

http://bioconductor.org/install/index.html

Martin


* I read (google) that matrices work faster than data frame. Can I use 
matrices for this case? (is matrices for numbers only?)
* I read (google) that I should avoid rbind, and prelocate data frame 
for faster speed. How would I do that in this case?


Thank you very much in advance,

Bests,

D.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speed up subsetting with certain conditions

2011-01-12 Thread Duke

On 1/12/11 6:12 PM, Martin Morgan wrote:
The Bioconductor project has many tools for dealing with 
sequence-related data. With the data


k - read.table(textConnection(
chr132375463237547rs523104280+
chr132375493237550rs520975820+
chr245133264513327rs297692800+
chr245133374513338rs332860090+))

f - read.table(textConnection(
chr13213435GC
chr13237547TC
chr13237549GT
chr24513326AG
chr24513337CG))

One might use the GenomicRanges package as

library(GenomicRanges)
kgr - with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr - with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps - findOverlaps(fgr, kgr)
idx - countOverlaps(fgr, kgr) != 0

resulting in

 idx
[1] FALSE  TRUE  TRUE  TRUE  TRUE

This will be fast.


Thanks so much for your suggestion Martin. I had Bioconductor installed 
but I honestly do not know all its applications. Anyway, I am testing 
GenomicRanges with my data now. I will report back when I get the result.




One could write foundY with as.data.frame(fgr[idx]) (maybe a little 
editing) but likely one would want to stay in R / Bioc and do 
something more interesting...




I suppose foundN - as.data.frame(fgr[!idx]) and foundY - 
as.data.frame(fgr[idx]) as you suggested, but I dont really understand 
your last comment :).


Thanks,

D.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   >