Dear All, I need to sum a column from another dataframe based on the row values of one dataframe. I am stuck in a loop trying to accomplish it and at current speed it will take more than 80 hours to complete. Needless to say I am looking for a more elegant/quicker solution. Really need some help here. Here is the issue:
I have a dataframe CALL (the dput of head is given below) which has close to a million rows. There are 2 date columns which are of importance, DATE and EXPDATE. There is another dataframe, VOL (dput of head given), which has 2 columns, DATE and VOL. It has the volatility corresponding to each day and it has a total of 124 records (corresponding to 6 months). I want to add another column in the CALL dataframe which would contain the sum of all the volatilities from the VOL df for the period specified by the interval of DATE and EXPDATE in each row of CALL df. For ex: In the first row, DATE is '03-01-2011' and EXPDATE is '27-01-2011'. So I want the SUM column (A new column in CALL df) to contain the sum of volatilities of 03-01, 04-01, 05-01 .... till 27-01 from the VOL dataframe. I have to repeat this process for all the rows in the dataframe. Here is the for-loop version of the solution: for (k in 1:nrow(CALL)){ CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE >= CALL$DATE[k] & VOL$DATE <= CALL$EXPDATE[k])) } The loop will run for close to a million times, it has been running for more than 10 hours and its just 12% complete. It would take more than 80 hours to complete, not the mention the toll it would take on my laptop. So is there a better way that I can accomplish this task? Any input would be greatly appreciated. Below are the dput of the two dataframes. One point of note is that there are only 124 DISTINCT values of DATE and 6 DISTINCT values of EXPDATE, in case it can be used in some way. > dput(CALL) structure(list(NAME = c("STK", "STK", "STK", "STK", "STK", "STK"), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, 15029), class = "Date"), STRIKE = c(6300L, 6300L, 6300L, 6300L, 6300L, 6300L), TMSTMP = c("14:18:36", "15:23:42", "15:22:30", "15:24:13", "15:22:07", "15:22:27"), PRICE = c(107, 102.05, 101.3, 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, 14977), class = "Date"), DTTM = structure(c(1294044516, 1294048422, 1294048350, 1294048453, 1294048327, 1294048347), class = c("POSIXct", "POSIXt"), tzone = ""), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c("NAME", "EXPDATE", "STRIKE", "TMSTMP", "PRICE", "QUANT", "DATE", "DTTM", "TTE"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame") > dput(VOL) structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, 1294252200, 1294338600, 1294597800), class = c("POSIXct", "POSIXt" ), tzone = ""), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, 0.000216076713994619)), .Names = c("DATE", "VOL"), row.names = c(NA, 6L), class = "data.frame") Please do let me know if any more information from my side would help or if I need to explain the issue more clearly. Any minor improvement will be great help. Thanks in advance. -Shivam -- *Victoria Concordia Crescit* ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.