On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle <mdo...@mdowle.plus.com> wrote:
> Looking at the timings by each stage may help :
>
>>   system.time(dt <- data.table(dat))
>   user  system elapsed
>   1.20    0.28    1.48
>>   system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8))   # sort by the
>> 8 columns (one-off)
>   user  system elapsed
>   4.72    0.94    5.67
>>   system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2,
>> x3, x4, x5, x6, x7, x8'])
>   user  system elapsed
>   2.00    0.21    2.20     # compared to 11.07s
>>
>
> data.table doesn't have a custom data structure, so it can't be that.
> data.table's structure is the same as data.frame i.e. a list of vectors.
> data.table inherits from data.frame.  It *is* a data.frame, too.
>
> The reasons it is faster in this example include :
> 1. Memory is only allocated for the largest group.
> 2. That memory is re-used for each group.
> 3. Since the data is ordered contiguously in RAM, the memory is copied over
> in bulk for each group using
> memcpy in C, which is faster than a for loop in C. Page fetches are
> expensive; they are minimised.

But this is exactly what I mean by a custom data structure - you're
not using the usual data frame API.

Wouldn't it be better to implement these changes to data frame so that
everyone can benefit? Or is it just too specialised to this particular
case (where I guess you're using that the return data structure of the
summary function is consistent)?

Hadley


-- 
Assistant Professor / Dobelman Family Junior Chair
Department of Statistics / Rice University
http://had.co.nz/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to