Try 'data.table' package. It took 3 seconds to aggregate the 500K levels: Is this what you were after?
> # note the characters are converted to factors that 'data.table' likes > dat=data.frame( + x1=sample(c(NA,'m','f'), 2e6, replace=TRUE), + x2=sample(c(NA, 1:10), 2e6, replace=TRUE), + x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE), + x4=sample(c(NA,T,F), 2e6, replace=TRUE), + x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6, + replace=TRUE), + x6=sample(c(NA, 1:10), 2e6, replace=TRUE), + x7=sample(c(NA,'married','divorced','separated','single','etc'), + 2e6, replace=TRUE), + x8=sample(c(NA,T,F), 2e6, replace=TRUE), + y=trunc(rnorm(2e6)*10000)) > str(dat) 'data.frame': 2000000 obs. of 9 variables: $ x1: Factor w/ 2 levels "f","m": NA NA 2 NA NA NA NA 1 1 1 ... $ x2: int 4 5 3 10 10 7 1 1 3 5 ... $ x3: Factor w/ 5 levels "a","b","c","d",..: 3 2 1 2 1 5 1 1 2 1 ... $ x4: logi NA TRUE TRUE NA FALSE NA ... $ x5: Factor w/ 4 levels "active","deleted",..: 4 3 3 2 2 1 1 NA 3 3 ... $ x6: int NA 2 7 2 1 9 NA 1 1 9 ... $ x7: Factor w/ 5 levels "divorced","etc",..: 1 3 5 NA 2 3 1 2 2 2 ... $ x8: logi NA NA NA FALSE FALSE FALSE ... $ y : num 3066 -13237 -7840 9728 1596 ... > require(data.table) > dat <- data.table(dat) > system.time(result <- dat[, sum(y), by = list(x1,x2,x3,x4,x5,x6,x7,x8)]) user system elapsed 3.11 0.16 3.26 > str(result) Classes ‘data.table’ and 'data.frame': 568594 obs. of 9 variables: $ x1: Factor w/ 2 levels "f","m": NA NA NA NA NA NA NA NA NA NA ... $ x2: int NA NA NA NA NA NA NA NA NA NA ... $ x3: Factor w/ 5 levels "a","b","c","d",..: NA NA NA NA NA NA NA NA NA NA ... $ x4: logi NA NA NA NA NA NA ... $ x5: Factor w/ 4 levels "active","deleted",..: NA NA NA NA NA NA NA NA NA NA ... $ x6: int NA NA NA NA NA NA NA NA NA NA ... $ x7: Factor w/ 5 levels "divorced","etc",..: NA NA NA 1 1 1 2 2 2 3 ... $ x8: logi NA FALSE TRUE NA FALSE TRUE ... $ V1: num 6641 -18158 3 -11202 -14437 ... > > On Sun, Feb 6, 2011 at 3:54 PM, Gene Leynes <gleyne...@gmail.com> wrote: > On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <iz...@psych.rochester.edu> wrote: > >> > >> > However, I don't think you've told us what you're actually trying to >> > accomplish... >> > >> > > I'm trying to aggregate the y value of a big data set which has several x's > and a y. > I'm using an abstracted example for many reasons. Partially, I'm using an > abstracted example to comply with the posting guidelines of having a > reproducible example. I'm really aggregating some incredibly boring and > complex customer data for an undisclosed client. > > As it turns out, > Aggregate will not work when some of x's are NA, unless you convert them to > factors, with NA's included. > > In my case, the data is so big that doing the conversions causes other > memory problems, and renders some of my numeric values useless for other > calculations. > > My real data looks more like this (except with a few more categories and > records): > > set.seed(100) > library(plyr) > dat=data.frame( > x1=sample(c(NA,'m','f'), 2e6, replace=TRUE), > x2=sample(c(NA, 1:10), 2e6, replace=TRUE), > x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE), > x4=sample(c(NA,T,F), 2e6, replace=TRUE), > x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6, > replace=TRUE), > x6=sample(c(NA, 1:10), 2e6, replace=TRUE), > x7=sample(c(NA,'married','divorced','separated','single','etc'), > 2e6, replace=TRUE), > x8=sample(c(NA,T,F), 2e6, replace=TRUE), > y=trunc(rnorm(2e6)*10000), stringsAsFactors=F) > str(dat) > ## The control total > sum(dat$y, na.rm=T) > ## The aggregate total > sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x) > ## The ddply total > sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x) > {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum) > > ddply worked a little better than I expected at first, but it slows to a > crawl or has runs out of memory too often for me to invest the time learning > how to use it. Just now it worked for 1m records, and it was just a bit > slower than aggregate. But for the 2m example it hasn't finished > calculating. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.