Hi:

There's definitely something amiss with aggregate() here since similar
functions from other packages can reproduce your 'control' sum. I expect
ddply() will have some timing issues because of all the subgrouping in your
data frame, but data.table did very well and the summaryBy() function in the
doBy package did OK:

library(data.table)
library(doBy)
# Utility function to remove missing values when computing the sum
# for use in summaryBy()
f <- function(x) sum(x, na.rm = TRUE)

> system.time({
+ dt <- data.table(dat)
+ setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)
+ udt <- dt[, list(y = sum(y, na.rm = TRUE)),
+             by = 'x1, x2, x3, x4, x5, x6, x7, x8']
+             })
   user  system elapsed
   7.52    1.56    9.14
> system.time(
+ udb <- summaryBy(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8,
+                    data = dat, FUN = f)
+            )
   user  system elapsed
  83.82    0.83   85.21

# To verify against the control sum:
> sum(udt$y)
[1] -5611158
> sum(udb$y)
[1] -5611158

Notice the difference in the number of rows of uda in comparison to the
other two:

> uda <- aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8,
                                data = dat, FUN = f)
> sum(uda$y)
[1] 6396661
> dim(uda)
[1] 77321     9
> dim(udt)
[1] 568353      9
> dim(udb)
[1] 568353      9

I used four different approaches to aggregate(), excluding the one above:

aggregate(y ~ ., data = dat, FUN = f)
aggregate(y ~ ., data = dat, FUN = sum, na.rm = TRUE)
aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8,
                     data = dat, FUN = sum, na.rm = TRUE)
aggregate(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8,
                     data = dat, FUN = sum, na.action= na.pass)

All yielded the same answer as in uda.

data.table has some distinct advantages in situations such as this - see its
vignette and FAQ for details.

HTH,
Dennis


On Sun, Feb 6, 2011 at 12:54 PM, Gene Leynes <gleyne...@gmail.com> wrote:

> On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <iz...@psych.rochester.edu>
> wrote:
>
> > >
> > > However, I don't think you've told us what you're actually trying to
> > > accomplish...
> > >
> >
>
> I'm trying to aggregate the y value of a big data set which has several x's
> and a y.
> I'm using an abstracted example for many reasons.  Partially, I'm using an
> abstracted example to comply with the posting guidelines of having a
> reproducible example.  I'm really aggregating some incredibly boring and
> complex customer data for an undisclosed client.
>
> As it turns out,
> Aggregate will not work when some of x's are NA, unless you convert them to
> factors, with NA's included.
>
> In my case, the data is so big that doing the conversions causes other
> memory problems, and renders some of my numeric values useless for other
> calculations.
>
> My real data looks more like this (except with a few more categories and
> records):
>
> set.seed(100)
> library(plyr)
> dat=data.frame(
>        x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
>        x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
>        x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
>        x4=sample(c(NA,T,F), 2e6, replace=TRUE),
>        x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
> replace=TRUE),
>        x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
>        x7=sample(c(NA,'married','divorced','separated','single','etc'),
> 2e6, replace=TRUE),
>        x8=sample(c(NA,T,F), 2e6, replace=TRUE),
>        y=trunc(rnorm(2e6)*10000), stringsAsFactors=F)
> str(dat)
> ## The control total
> sum(dat$y, na.rm=T)
> ## The aggregate total
> sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x)
> ## The ddply total
> sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x)
>        {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum)
>
> ddply worked a little better than I expected at first, but it slows to a
> crawl or has runs out of memory too often for me to invest the time
> learning
> how to use it.  Just now it worked for 1m records, and it was just a bit
> slower than aggregate.  But for the 2m example it hasn't finished
> calculating.
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to