Re: [R] Performance of 'by' and 'ddply' on a large data frame

2009-11-20 Thread Tahir Butt
A faster solution using tapply was sent to me via email:

testtapply = function(p){
   df = randomdf(p)
   system.time({res = tapply(df$x2,df$x1,min);
res = as.Date(res,origin=as.Date('1970-01-01'));
df$mindate = res[as.character(df$x1)]})
}

Thanks Phil!

Tahir

On Thu, Nov 19, 2009 at 5:19 PM, Tahir Butt tahir.b...@gmail.com wrote:
 I've only recently started using R. One of the problems I come up
 against is after having extracted a large dataset (5M rows) out of
 database, I realize I need another variable. In this case I have data
 frame with dates. I want to find the minimum date for each value of x1
 and add that minimum date to my data.frame.

 randomdf - function(p) {
 data.frame(x1=sample(1:10^4, 10^p, replace=T),
 x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by=day), 10^p, replace=T),
 y1=sample(1:100, 10^p, replace=T))
 }
 testby - function(p) {
 df - randomdf(p)
 system.time(by(df, df$x1, function(dfi) { min(dfi$x2) }))
 }
 lapply(c(1,2,3,4,5), testby)
 [[1]]
   user  system elapsed
  0.006   0.000   0.006

 [[2]]
   user  system elapsed
  0.024   0.000   0.025

 [[3]]
   user  system elapsed
  0.233   0.000   0.234

 [[4]]
   user  system elapsed
  1.996   0.026   2.022

 [[5]]
   user  system elapsed
 11.030   0.000  11.032

 Strangely enough, not sure why this is, the result of by with the min
 function is not date objects but instead integers representing days
 from an origin. Is there a min function that would return me a date
 instead of an integer? Or is this a result of using by?

 I also wanted to see how ddply compares.

 testddply - function(p) { pdf - randomdf(p); system.time(ddply(pdf, .(x1), 
 function(df) { return (data.frame(min(df$x2))) })) }
 lapply(c(1,2,3,4,5), testddply)
 [[1]]
   user  system elapsed
  0.020   0.000   0.021

 [[2]]
   user  system elapsed
  0.119   0.000   0.119

 [[3]]
   user  system elapsed
  1.008   0.000   1.008

 [[4]]
   user  system elapsed
  8.425   0.001   8.428

 [[5]]
   user  system elapsed
  23.070   0.000  23.075

 Once the data frame gets above 1M rows, the timings are a bit too long
 (on a previous run it went up to 8000s user time). This seems quite a
 bit slower than I expected. Maybe there's a better and faster way to
 add such variables to a data frame that are derived using some
 aggregation.

 Also, ddply seems to take twice as long as by. Are these two
 operations not equivalent?

 Thanks,
 Tahir


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Performance of 'by' and 'ddply' on a large data frame

2009-11-19 Thread Tahir Butt
I've only recently started using R. One of the problems I come up
against is after having extracted a large dataset (5M rows) out of
database, I realize I need another variable. In this case I have data
frame with dates. I want to find the minimum date for each value of x1
and add that minimum date to my data.frame.

 randomdf - function(p) {
data.frame(x1=sample(1:10^4, 10^p, replace=T),
x2=sample(seq.Date(Sys.Date() - 356*3,Sys.Date(), by=day), 10^p, replace=T),
y1=sample(1:100, 10^p, replace=T))
}
 testby - function(p) {
df - randomdf(p)
system.time(by(df, df$x1, function(dfi) { min(dfi$x2) }))
}
 lapply(c(1,2,3,4,5), testby)
[[1]]
  user  system elapsed
 0.006   0.000   0.006

[[2]]
  user  system elapsed
 0.024   0.000   0.025

[[3]]
  user  system elapsed
 0.233   0.000   0.234

[[4]]
  user  system elapsed
 1.996   0.026   2.022

[[5]]
  user  system elapsed
11.030   0.000  11.032

Strangely enough, not sure why this is, the result of by with the min
function is not date objects but instead integers representing days
from an origin. Is there a min function that would return me a date
instead of an integer? Or is this a result of using by?

I also wanted to see how ddply compares.

 testddply - function(p) { pdf - randomdf(p); system.time(ddply(pdf, .(x1), 
 function(df) { return (data.frame(min(df$x2))) })) }
 lapply(c(1,2,3,4,5), testddply)
[[1]]
   user  system elapsed
  0.020   0.000   0.021

[[2]]
   user  system elapsed
  0.119   0.000   0.119

[[3]]
   user  system elapsed
  1.008   0.000   1.008

[[4]]
   user  system elapsed
  8.425   0.001   8.428

[[5]]
   user  system elapsed
 23.070   0.000  23.075

Once the data frame gets above 1M rows, the timings are a bit too long
(on a previous run it went up to 8000s user time). This seems quite a
bit slower than I expected. Maybe there's a better and faster way to
add such variables to a data frame that are derived using some
aggregation.

Also, ddply seems to take twice as long as by. Are these two
operations not equivalent?

Thanks,
Tahir

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.