Hi: There's definitely something amiss with aggregate() here since similar functions from other packages can reproduce your 'control' sum. I expect ddply() will have some timing issues because of all the subgrouping in your data frame, but data.table did very well and the summaryBy() function in the doBy package did OK:
library(data.table) library(doBy) # Utility function to remove missing values when computing the sum # for use in summaryBy() f <- function(x) sum(x, na.rm = TRUE) > system.time({ + dt <- data.table(dat) + setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8) + udt <- dt[, list(y = sum(y, na.rm = TRUE)), + by = 'x1, x2, x3, x4, x5, x6, x7, x8'] + }) user system elapsed 7.52 1.56 9.14 > system.time( + udb <- summaryBy(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, + data = dat, FUN = f) + ) user system elapsed 83.82 0.83 85.21 # To verify against the control sum: > sum(udt$y) [1] -5611158 > sum(udb$y) [1] -5611158 Notice the difference in the number of rows of uda in comparison to the other two: > uda <- aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = dat, FUN = f) > sum(uda$y) [1] 6396661 > dim(uda) [1] 77321 9 > dim(udt) [1] 568353 9 > dim(udb) [1] 568353 9 I used four different approaches to aggregate(), excluding the one above: aggregate(y ~ ., data = dat, FUN = f) aggregate(y ~ ., data = dat, FUN = sum, na.rm = TRUE) aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = dat, FUN = sum, na.rm = TRUE) aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, data = dat, FUN = sum, na.action= na.pass) All yielded the same answer as in uda. data.table has some distinct advantages in situations such as this - see its vignette and FAQ for details. HTH, Dennis On Sun, Feb 6, 2011 at 12:54 PM, Gene Leynes <gleyne...@gmail.com> wrote: > On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <iz...@psych.rochester.edu> > wrote: > > > > > > > However, I don't think you've told us what you're actually trying to > > > accomplish... > > > > > > > I'm trying to aggregate the y value of a big data set which has several x's > and a y. > I'm using an abstracted example for many reasons. Partially, I'm using an > abstracted example to comply with the posting guidelines of having a > reproducible example. I'm really aggregating some incredibly boring and > complex customer data for an undisclosed client. > > As it turns out, > Aggregate will not work when some of x's are NA, unless you convert them to > factors, with NA's included. > > In my case, the data is so big that doing the conversions causes other > memory problems, and renders some of my numeric values useless for other > calculations. > > My real data looks more like this (except with a few more categories and > records): > > set.seed(100) > library(plyr) > dat=data.frame( > x1=sample(c(NA,'m','f'), 2e6, replace=TRUE), > x2=sample(c(NA, 1:10), 2e6, replace=TRUE), > x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE), > x4=sample(c(NA,T,F), 2e6, replace=TRUE), > x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6, > replace=TRUE), > x6=sample(c(NA, 1:10), 2e6, replace=TRUE), > x7=sample(c(NA,'married','divorced','separated','single','etc'), > 2e6, replace=TRUE), > x8=sample(c(NA,T,F), 2e6, replace=TRUE), > y=trunc(rnorm(2e6)*10000), stringsAsFactors=F) > str(dat) > ## The control total > sum(dat$y, na.rm=T) > ## The aggregate total > sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x) > ## The ddply total > sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x) > {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum) > > ddply worked a little better than I expected at first, but it slows to a > crawl or has runs out of memory too often for me to invest the time > learning > how to use it. Just now it worked for 1m records, and it was just a bit > slower than aggregate. But for the 2m example it hasn't finished > calculating. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.