I've also made some comparisons and taking into account execution time, sqldf wins. SummaryBy is better then aggregate in some specific situations I met in practice. I present this situation below. It assumes, that there are at least two groups with high number of levels.
n<-100000; grp1<-sample(1:750, n, replace=T) grp2<-sample(1:750, n, replace=T) d<-data.frame(x=rnorm(n), y=rnorm(n), grp1=grp1, grp2=grp2, n, replace=T) # sqldf library(sqldf) Rprof('prof'); sqldf("select grp1, grp2, avg(x), avg(y) from d group by grp1, grp2") Rprof(NULL); summaryRprof('prof') #by #do.call(rbind, by(d, list(d$grp1, d$grp2), function(x) transform(x, x = mean(x), y = mean(y))[1,,drop = FALSE ])) #doBy library(doBy) Rprof('prof'); summaryBy(x+y~grp1+grp2, data=d, FUN=c(mean)) Rprof(NULL); summaryRprof('prof') #aggregate Rprof('prof'); aggregate(d, list(d$grp1, d$grp2), function(x)mean(x)) Rprof(NULL); summaryRprof('prof') ---------- Forwarded message ---------- From: Nikhil Kaza <nikhil.l...@gmail.com> Date: 2009/12/9 Subject: Re: [R] conditionally merging adjacent rows in a data frame To: Titus von der Malsburg <malsb...@gmail.com> DW: r-help@r-project.org This is great!! Sqldf is exactly the kind of thing I was looking for, other stuff. I suppose you can speed up both functions 1 and 5 using aggregate and tapply only once, as was suggested earlier. But it comes at the expense of readability. Nikhil On 9 Dec 2009, at 7:59AM, Titus von der Malsburg wrote: > On Wed, Dec 9, 2009 at 12:11 AM, Gabor Grothendieck > <ggrothendi...@gmail.com> wrote: >> >> Here are a couple of solutions. The first uses by and the second sqldf: > > Brilliant! Now I have a whole collection of solutions. I did a simple > performance comparison with a data frame that has 7929 lines. > > The results were as following (loading appropriate packages is not included in > the measurements): > > times <- c(0.248, 0.551, 41.080, 0.16, 0.190) > names(times) <- c("aggregate","summaryBy","by+transform","sqldf","tapply") > barplot(times, log="y", ylab="log(s)") > > So sqldf clearly wins followed by tapply and aggregate. summaryBy is slower > than necessary because it computes for x and dur both, mean /and/ sum. > by+transform presumably suffers from the contruction of many intermediate data > frames. > > Are there any canonical places where R-recipes are collected? If yes I would > write-up a summary. > > These were the competitors: > > # Gary's and Nikhil's aggregate solution: > > aggregate.fixations1 <- function(d) { > > idx <- c(TRUE,diff(d$roi)!=0) > d2 <- d[idx,] > > idx <- cumsum(idx) > d2$dur <- aggregate(d$dur, list(idx), sum)[2] > d2$x <- aggregate(d$x, list(idx), mean)[2] > > d2 > } > > # Marek's symmaryBy: > > library(doBy) > > aggregate.fixations2 <- function(d) { > > idx <- c(TRUE,diff(d$roi)!=0) > d2 <- d[idx,] > > d$idx <- cumsum(idx) > d2$r <- summaryBy(dur+x~idx, data=d, FUN=c(sum, > mean))[c("dur.sum", "x.mean")] > d2 > } > > # Gabor's by+transform solution: > > aggregate.fixations3 <- function(d) { > > idx <- cumsum(c(TRUE,diff(d$roi)!=0)) > > d2 <- do.call(rbind, by(d, idx, function(x) > transform(x, dur = sum(dur), x = mean(x))[1,,drop = FALSE ])) > > d2 > } > > # Gabor's sqldf solution: > > library(sqldf) > > aggregate.fixations4 <- function(d) { > > idx <- c(TRUE,diff(d$roi)!=0) > d2 <- d[idx,] > > d$idx <- cumsum(idx) > d2$r <- sqldf("select sum(dur), avg(x) x from d group by idx") > > d2 > } > > # Titus' solution using plain old tapply: > > aggregate.fixations5 <- function(d) { > > idx <- c(TRUE,diff(d$roi)!=0) > d2 <- d[idx,] > > idx <- cumsum(idx) > d2$dur <- tapply(d$dur, idx, sum) > d2$x <- tapply(d$x, idx, mean) > > d2 > } > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Marek ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.