Hi: The key phrase in your mail was 'data.table'. Given the size of the object, it is very likely to be a data.table, which (oddly enough) comes from package data.table. It is designed to quickly process information in very large datasets. 3M rows is an 'average' sized data.table :)
Your request isn't very sophisticated - it appears this function should work groupwise (group = ID): dret <- function(x) c(100.00, 100 * x[-1]/x[-length(x)]) This function can be processed in data.table or ddply (package plyr) groupwise without much difficulty. I'm going to assume that the data are ordered in time for simplicity. I'm also using 100 for the first entry in the function - if you want, you can change the initial 100.00 to NA. Let's generate a little fake data: id <- as.character(rep(c(427225, 290157, 394025, 382940), each = 1000)) times <- rep(seq(as.Date('2001-11-13'), by = 'days', length = 1000), 4) totret <- c(rnorm(1000, 20, 0.1), rnorm(1000, 25, 0.1), rnorm(1000, 30, 0.1), rnorm(1000, 35, 0.1)) # data frame: DF <- data.frame(id = id, times = times, totret = totret) # data table: library(data.table) DR <- data.table(DF) # data.table sets up id as the table's primary key - note that the storage mode # of the key has to be integer. tables() # see what we've got # set id as the table key, do the calculation by group and tack the result onto DR system.time({ setkey(DR, id); DR2 <- DR[, dret(totret), by = id]; DR$return <- DR2$V1 }) user system elapsed 0 0 0 library(plyr) system.time(df2 <- ddply(DF, .(id), transform, return = dret(totret))) user system elapsed 0.03 0.00 0.05 The difference between the two is this. The data.table calculation returns a data.table DR2 with the key and the returns, after which we add the column of returns to the original data table DR. In contrast, the ddply calculation tacks on the column of returns to the original data frame as a result of transform. Notice that in the data.table code, we set the table key (which is often the most time consuming task, since it orders the data by the values in its key), did the calculation and tacked the result onto the original table almost instantaneously. According to the data.table package author, the time savings in using data.table scales upward as the size of the table increases - in other words, the bigger the table, the faster data.table will be relative to other processing methods currently available in R. You can see that there is a noticeable time difference at n = 4000, so the difference at n = 3M will be more dramatic. Development work in plyr is showing that the gap between it and data.table is narrowing, but both packages are in active development, so R users can look forward to two very powerful packages for summarizing, transforming and condensing data. I would suggest that you read the vignette and FAQ from data.table (available from the on-line data.table help page) and the documentation of plyr at its author's web site: http://had.co.nz/plyr/ There is a tutorial with slides and a full-scale document. HTH, Dennis On Thu, Jun 3, 2010 at 8:04 PM, Jeff08 <jefferyd...@gmail.com> wrote: > > Hello Everyone, > > I just started a new job & it requires heavy use of R to analyze datasets. > > I have a data.table that looks like this. It is sorted by ID & Date, there > are about 150 different IDs & the dataset spans 3 million rows. The main > columns of concern are ID, date, and totret. What I need to do is to derive > daily returns for each ID from totret, which is simply totret at time t+1 > divided by totret at time t. > > X id ticker date_ adjClose totret RankStk > 427225 427225 00174410 AHS 2001-11-13 21.66 100.00000 1235 > 441910 441910 00174410 AHS 2001-11-14 21.60 99.72300 1235 > 458458 458458 00174410 AHS 2001-11-15 21.65 99.95380 1235 > 284003 284003 00174410 AHS 2001-11-16 21.59 99.67680 1235 > > Two problems for me: > > 1)I can't just apply it to the entire column since there will be problems > at > the boundary points where the ID changes from 1 to another. I need to find > out how to specify a restriction on the name of the ID > > 2) From Java, instinctively I would use a loop to calculate daily returns, > but I found out that R is very slow with loops, so I need to find an > efficient way to calculate daily returns with such a huge dataset. > > Thanks a lot! > > > -- > View this message in context: > http://r.789695.n4.nabble.com/R-Newbie-please-help-tp2242633p2242633.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.