> There's definitely something amiss with aggregate() here since similar
> functions from other packages can reproduce your 'control' sum. I expect
> ddply() will have some timing issues because of all the subgrouping in your
> data frame, but data.table did very well and the summaryBy() function in the
> doBy package did OK:

Well, if you use the right plyr function, it works just fine:

system.time(count(dat, c("x1", "x2", "x3", "x4", "x4", "x5", "x6",
"x7", "x8"), "y"))
#   user  system elapsed
#  9.754   1.314  11.073

Which illustrates something that I've believed for a while about
data.table - it's not the indexing that speed things up, it's the
custom data structure.  If you use ddply with data frames, it's slow
because data frames are slow.  I think the right way to resolve this
is to to make data frames more efficient, perhaps using some kind of
mutable interface where necessary for high-performance operations.

Hadley

-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to