[R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread Leo Alekseyev
I have a data frame with about 10^6 rows; I want to group the data according to entries in one of the columns and do something with it. For instance, suppose I want to count up the number of elements in each group. I tried something like aggregate(my.df$my.field, list(my.df$my.field), length) but

Re: [R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread David Winsemius
table is reasonably fast. I have more than 4 X 10^6 records and a 2D table takes very little time: nUA - with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are factors nUA This does the same thing and took about 5 seconds just now: xtabs( ~ URwbc + URrbc, data=TRdta) On Sep 2,

Re: [R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread jim holtman
Take 0.6 seconds on my slow laptop: n - 1e6 x - data.frame(a=sample(LETTERS, n, TRUE)) system.time(print(tapply(x$a, x$a, length))) A B C D E F G H I J K L M N O P Q 38555 38349 38647 38271 38456 38352 38644 38679 38575 38730

Re: [R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread David M Smith
You may want to try using isplit (from the iterators package). Combined with foreach, it's an efficient way of iterating through a data frame by groups of rows defined by common values of a columns (which I think is what you're after). You can speed things up further if you have a multiprocessor

Re: [R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread Leo Alekseyev
Thanks everyone for the useful suggestions. The bottleneck might be memory limitations of my machine (3.2GHz, 2 GB) and the fact that I am aggregating on a field that is a string. Using the suggested as.data.frame(table(my.df$my.field)) I do get a speedup, but the computation still takes 30

Re: [R] Grouping data in a data frame: is there an efficient way to do it?

2009-09-02 Thread milton ruser
Hi there, I think the option of 30 seconds is ok because it is less than each one expent reading the messages :-) Just kiding... bests milton On Wed, Sep 2, 2009 at 8:01 PM, Leo Alekseyev dnqu...@gmail.com wrote: Thanks everyone for the useful suggestions. The bottleneck might be memory