I have two large data frames with the following structure: > df1 id date test1.result 1 a 2009-08-28 1 2 a 2009-09-16 1 3 b 2008-08-06 0 4 c 2012-02-02 1 5 c 2010-08-03 1 6 c 2012-08-02 0
> df2 id date test2.result 1 a 2011-02-03 1 2 b 2011-09-27 0 3 b 2011-09-01 1 4 c 2009-07-16 0 5 c 2009-04-15 0 6 c 2010-08-10 1 I need to match items in df2 to those in df1 with specific matching criteria. I have written a looped matching algorithm that works, but it is very slow with my large datasets. I am requesting help on making a version of this code that is faster and “vectorized" so to speak. My algorithm is currently something like this code. It works but is damn slow. findTestPairs <- function(test1, id1, date1, test2, id2, date2, predays=-30, lagdays=30){ # Function to find, within subjects, two tests that occur with a timeframe # # test1 = the reference test result for which matching second tests are sought # test2 = the second test result # date1 = the date of test1 # date2 = the date of test2 # id1 = unique identifier for subject undergoing test 1 # id2 = unique identifier for subject undergoing test 2 # predays = maximum number of days prior to test1 date that test2 date might occur # lagdays = maximum number of days after test1 date that test2 date might occur result <- data.frame(matrix(ncol=5, nrow=length(test1))) colnames(result) <- c('id','test1','date','test2count',’test2lag.result') result$id <- id1 result$test1 <- test1 result$date <- date1 for(i in 1:length(test1)){ l <- 0 # Counter of test2 results that matches test1 within lag interval m <- NA # Indicator of positive test2 within lag interval for(j in 1:length(test2)){ if(id1[i] == id2[j]){ # STEP1: Match IDs interval <- date2[j] - date1[i] intmatch <- ifelse(interval >= predays && interval <= lagdays, 1, 0) if(intmatch == 1){ # STEP2: Does test2 fall within lag interval? l <- l+1 # If test2 within lag interval, count it if(test2[j] == 1) { # STEP3: Is test 2 positive? m <- 1 # If test2 is positive, set indicator to 1 } else { m <- 0 } } } } result$test2count[i] <- l result$test2lag.result[i] <- m } return(result) } I would appreciate help on building a faster matching algorithm. I am pretty certain that R functions can be used to do this but I do not have a good grasp of how to make it work. Brant Inman [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.