I have a data frame with about 10^6 rows; I want to group the data
according to entries in one of the columns and do something with it.
For instance, suppose I want to count up the number of elements in
each group. I tried something like aggregate(my.df$my.field,
list(my.df$my.field), length) but
table is reasonably fast. I have more than 4 X 10^6 records and a 2D
table takes very little time:
nUA - with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are
factors
nUA
This does the same thing and took about 5 seconds just now:
xtabs( ~ URwbc + URrbc, data=TRdta)
On Sep 2,
Take 0.6 seconds on my slow laptop:
n - 1e6
x - data.frame(a=sample(LETTERS, n, TRUE))
system.time(print(tapply(x$a, x$a, length)))
A B C D E F G H I J K
L M N O P Q
38555 38349 38647 38271 38456 38352 38644 38679 38575 38730
You may want to try using isplit (from the iterators package). Combined with
foreach, it's an efficient way of iterating through a data frame by groups
of rows defined by common values of a columns (which I think is what you're
after). You can speed things up further if you have a multiprocessor
Thanks everyone for the useful suggestions. The bottleneck might be
memory limitations of my machine (3.2GHz, 2 GB) and the fact that I am
aggregating on a field that is a string. Using the suggested
as.data.frame(table(my.df$my.field)) I do get a speedup, but the
computation still takes 30
Hi there,
I think the option of 30 seconds is ok because it is less than each one
expent reading the messages :-) Just kiding...
bests
milton
On Wed, Sep 2, 2009 at 8:01 PM, Leo Alekseyev dnqu...@gmail.com wrote:
Thanks everyone for the useful suggestions. The bottleneck might be
memory
6 matches
Mail list logo