The user wrote in their first post : > I have a lot of observations in my dataset
Heres one way to do it with a data.table : a=data.table(a) ans = a[ , list(dt=dt[dt-min(dt)<7]) , by="var1,var2,var3"] class(ans$dt) = "Date" Timings are below comparing the 3 methods. In this example, data.table appears to be 28 times faster than plyr, and 24 times faster than sqldf. I excluded the one off time to build the key, since thats realistic, but even including that time, data.table is still 16 times faster than plyr (134 / (1.03+2.16+4.71)). With even more rows, it should be even bigger speedups. > a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = > c("c", "n", "s"), class = "factor"), var2 = c(1L, 1L, 1L, 2L, 2L, 2L), var3 = c(2L, 2L, 2L, 1L, 1L, 1L), dt = structure(c(10592, 10997, 11000, 10998, 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3", "dt"), row.names = c(NA, -6L), class = "data.frame") > > a = data.frame(lapply(a,function(x)rep(x,each=1000000))) > dim(a) [1] 6000000 4 > library(plyr) > system.time({ans1 <<- ddply(a, c("var1", "var2", "var3"), subset, dt - > min(dt) < 7)}) user system elapsed 131.39 3.11 134.80 > library(sqldf) > system.time({ans2 <<- sqldf("select var1, var2, var3, dt from a, (select > var1, var2, var3, min(dt) mindt from a group by var1, var2, var3) > using(var1, var2, var3) where dt - mindt < 7")}) user system elapsed 110.26 2.24 113.32 > mapply(identical,ans1,ans2[order(ans2$var1),]) var1 var2 var3 dt TRUE TRUE TRUE TRUE > > library(data.table) > system.time({adt<<-data.table(a)}) user system elapsed 0.90 0.13 1.03 > system.time({setkey(adt,var1,var2,var3)}) user system elapsed 1.89 0.27 2.16 > system.time({ans3 <<- > adt[,list(dt=dt[dt-min(dt)<7]),by="var1,var2,var3"]}) user system elapsed 3.92 0.78 4.71 > class(ans3$dt) = "Date" > mapply(identical,ans1,ans3) var1 var2 var3 dt TRUE TRUE TRUE TRUE Note that in the documentaton ?"[.data.table" where I say that 'by' is slow, I mean relative to how fast it could be. Its seems, in this specific example anyway, and with the code posted so far, to be significantly faster than sqldf and plyr. "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote in message news:971536df1001191350x3bd5d982j9879e05453760...@mail.gmail.com... > Using data frame, a, from the post below this is how it would be done > in SQL using sqldf. We join together the original table, a, with a > table of minimums (computed by the nested select) and then choose only > the rows where dt - mindt < 7 (in the where clause). > >> library(sqldf) >> sqldf("select var1, var2, var3, dt from a, (select var1, var2, var3, >> min(dt) mindt from a group by var1, var2, var3) using(var1, var2, var3) >> where dt - mindt < 7") > var1 var2 var3 dt > 1 s 1 2 1999-01-01 > 2 c 1 2 2000-02-10 > 3 c 1 2 2000-02-13 > 4 n 2 1 2000-02-11 > 5 n 2 1 2000-02-15 > > > On Tue, Jan 19, 2010 at 4:22 PM, hadley wickham <h.wick...@gmail.com> > wrote: >> On Mon, Jan 18, 2010 at 1:54 PM, Bert Gunter <gunter.ber...@gene.com> >> wrote: >>> One way to do it: >>> >>> 1. Convert your date column to the Date class using the as.Date() >>> function. >>> This allows you to do the necessary arithmetic on the dates below. >>> dt <- as.Date(a[,4],"%d/%m/%Y") >>> >>> 2. Create a factor out of your first three columns whose levels are in >>> the >>> same order as the unique rows. Something likes the following should do >>> it: >>> fac <- do.call(paste,a[,-4]) >>> fac <- factor(fac, levels=unique(fac)) >>> >>> This allows you to choose the groups of rows whose dates you wish to >>> compare >>> and maintain their correct order in the data frame >>> >>> 3. Then use tapply: >>> a[unlist(tapply(dt,fac,function(x)x-min(x) < 7)),] >>> >>> (unlist is needed to remove the list structure and concatenate the >>> logical >>> indices to obtain the subscripting vector). >> >> Here's the same basic approach with the plyr package: >> >> a <- structure(list(var1 = structure(c(3L, 1L, 1L, 2L, 2L, 2L), .Label = >> c("c", >> "n", "s"), class = "factor"), var2 = c(1, 1, 1, 2, 2, 2), var3 = c(2, >> 2, 2, 1, 1, 1), dt = structure(c(10592, 10997, 11000, 10998, >> 11002, 11010), class = "Date")), .Names = c("var1", "var2", "var3", >> "dt"), row.names = c(NA, -6L), class = "data.frame") >> >> library(plyr) >> ddply(a, c("var1", "var2", "var3"), subset, dt - min(dt) < 7) >> >> Hadley > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.