Hi:

The key phrase in your mail was 'data.table'. Given the size of the object,
it is very likely to be a data.table, which (oddly enough) comes from
package
data.table. It is designed to quickly process information in very large
datasets. 3M rows is an 'average' sized data.table :)

Your request isn't very sophisticated - it appears this function should
work groupwise (group = ID):

dret <- function(x) c(100.00, 100 * x[-1]/x[-length(x)])

This function can be processed in data.table or ddply (package plyr)
groupwise without much difficulty. I'm going to assume that the data are
ordered in time for simplicity. I'm also using 100 for the first entry in
the
function - if you want, you can change the initial 100.00 to NA.

Let's generate a little fake data:
id <- as.character(rep(c(427225, 290157, 394025, 382940), each = 1000))
times <- rep(seq(as.Date('2001-11-13'), by = 'days', length = 1000), 4)
totret <- c(rnorm(1000, 20, 0.1), rnorm(1000, 25, 0.1), rnorm(1000, 30,
0.1),
                 rnorm(1000, 35, 0.1))

# data frame:
DF <- data.frame(id = id, times = times, totret = totret)

# data table:
library(data.table)
DR <- data.table(DF)

# data.table sets up id as the table's primary key - note that the storage
mode
# of the key has to be integer.
tables()   # see what we've got

# set id as the table key, do the calculation by group and tack the result
onto DR
system.time({ setkey(DR, id); DR2 <- DR[, dret(totret), by = id];
                      DR$return <- DR2$V1 })
   user  system elapsed
      0       0       0

library(plyr)
system.time(df2 <- ddply(DF, .(id), transform, return = dret(totret)))
   user  system elapsed
   0.03    0.00    0.05

The difference between the two is this. The data.table calculation returns a
data.table DR2 with the key and the returns, after which we add the column
of returns to the original data table DR. In contrast, the ddply calculation
tacks
on the column of returns to the original data frame as a result of
transform.
Notice that in the data.table code, we set the table key (which is often the
most time consuming task, since it orders the data by the values in its
key),
did the calculation and tacked the result onto the original table almost
instantaneously.

According to the data.table package author, the time savings in using
data.table scales upward as the size of the table increases - in other
words,
the bigger the table, the faster data.table will be relative to other
processing
methods currently available in R. You can see that there is a noticeable
time
difference at n = 4000, so the difference at n = 3M will be more dramatic.
Development work in plyr is showing that the gap between it and data.table
is narrowing, but both packages are in active development, so R users can
look forward to two very powerful packages for summarizing, transforming
and condensing data.

I would suggest that you read the vignette and FAQ from data.table
(available
from the on-line data.table help page) and the documentation of plyr
at its author's web site:
http://had.co.nz/plyr/
There is a tutorial with slides and a full-scale document.

HTH,
Dennis


On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <jefferyd...@gmail.com> wrote:

>
> Hello Everyone,
>
> I just started a new job & it requires heavy use of R to analyze datasets.
>
> I have a data.table that looks like this. It is sorted by ID & Date, there
> are about 150 different IDs & the dataset spans 3 million rows. The main
> columns of concern are ID, date, and totret. What I need to do is to derive
> daily returns for each ID from totret, which is simply totret at time t+1
> divided by totret at time t.
>
>              X       id ticker      date_ adjClose    totret RankStk
> 427225   427225 00174410    AHS 2001-11-13    21.66 100.00000    1235
> 441910   441910 00174410    AHS 2001-11-14    21.60  99.72300    1235
> 458458   458458 00174410    AHS 2001-11-15    21.65  99.95380    1235
> 284003   284003 00174410    AHS 2001-11-16    21.59  99.67680    1235
>
> Two problems for me:
>
> 1)I can't just apply it to the entire column since there will be problems
> at
> the boundary points where the ID changes from 1 to another. I need to find
> out how to specify a restriction on the name of the ID
>
> 2) From Java, instinctively I would use a loop to calculate daily returns,
> but I found out that R is very slow with loops, so I need to find an
> efficient way to calculate  daily returns with such a huge dataset.
>
> Thanks a lot!
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to