Tim, If you just want the event.seq, you should be able to get really fast results with:
unsplit( lapply(split(df$event.of.interest, df$subject), FUN = function(x) { cumsum(cumsum(x)>0) } ), df$subject) Doesn't produce a matrix, but it's fast and you could cbind it afterwards. That also turns repetitive cbind calls into a single one, which should greatly increase efficiency. You could produce the subject.seq similarly by changing function(x) to 1:length(x), but that would mean spliting the data up twice (although it may still produce faster results than the previous options.. let us know). -Tariq On 2/15/07, jim holtman <[EMAIL PROTECTED]> wrote: > > One concern that I have with the 'tapply' approach is that it does not > create the correct results if the data it not in sorted order. See the > example below: > > > # generate an unsorted set of data > > set.seed(123) > > x <- data.frame(a=sample(1:3,12,TRUE), b=sample(0:1, 12, TRUE)) > > x > a b > 1 1 1 > 2 3 1 > 3 2 0 > 4 3 1 > 5 3 0 > 6 1 0 > 7 2 0 > 8 3 1 > 9 2 1 > 10 2 1 > 11 3 1 > 12 2 1 > > # using the tapply to sequence with grouping by 'a'; doesn't create > right > results > > # because it put the data back in the 'sorted' order, but data unsorted > > x$seq<-unlist(tapply(x$a, x$a, function(x)1:length(x))) > > x > a b seq > 1 1 1 1 > 2 3 1 2 > 3 2 0 1 > 4 3 1 2 > 5 3 0 3 > 6 1 0 4 > 7 2 0 5 > 8 3 1 1 > 9 2 1 2 > 10 2 1 3 > 11 3 1 4 > 12 2 1 5 > > # using 'by' to group and then sequence > > do.call(rbind, by(x, x$a, function(z) {z$new.seq <- 1:nrow(z); z})) > a b seq new.seq > 1.1 1 1 1 1 > 1.6 1 0 4 2 > 2.3 2 0 1 1 > 2.7 2 0 5 2 > 2.9 2 1 2 3 > 2.10 2 1 3 4 > 2.12 2 1 5 5 > 3.2 3 1 2 1 > 3.4 3 1 2 2 > 3.5 3 0 3 3 > 3.8 3 1 1 4 > 3.11 3 1 4 5 > > > > > > On 2/14/07, Tim Churches <[EMAIL PROTECTED]> wrote: > > > > jim holtman wrote: > > > On 2/14/07, Tim Churches <[EMAIL PROTECTED]> wrote: > > >> Any advice, tips, clues or pointers to resources on how best to speed > > up > > >> or, better still, avoid the loops in the following example code much > > >> appreciated. My actual dataset has several tens of thousands of rows > > and > > >> lots of columns, and these loops take a rather long time to run. > > >> Everything else which I need to do is done using vectors and those > > parts > > >> all run very quickly indeed. I spent quite a while doing searches on > > >> r-help and re-reading the various manuals, but couldn't find any > > >> existing relevant advice. I am sure the solution is obvious, but it > > >> escapes me. > > >> > > >> Tim C > > >> > > >> # create an example data frame, multiple events per subject > > >> > > >> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983) > > >> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F) > > >> subject <- c(1,1,1,2,2,3,3,3,3,4,4) > > >> df <- data.frame(cbind(subject,year,event.of.interest)) > > >> > > >> # add a per-subject sequence number > > >> > > >> df$subject.seq <- 1 > > >> for (i in 2:nrow(df)) { > > >> if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <- > > >> df$subject.seq[i-1] + 1 > > >> } > > >> df > > > > > > # add an event sequence number which is zero until the first > > >> # event of interest for that subject happens, and then increments > > >> # thereafter > > >> > > >> df$event.seq <- 0 > > >> for (i in 1:nrow(df)) { > > >> if (df$subject.seq[i] == 1 ) { > > >> current.event.seq <- 0 > > >> } > > >> if (event.of.interest[i] == 1 | current.event.seq > 0) > > >> current.event.seq <- current.event.seq + 1 > > >> df$event.seq[i] <- current.event.seq > > >> } > > >> df > > > > > > > > > > > > try: > > > > > >> df <- data.frame(cbind(subject,year,event.of.interest)) > > >> df <- do.call(rbind,by(df, df$subject, function(z){z$subject.seq <- > > > seq(nrow(z)); z})) > > >> df > > > subject year event.of.interest subject.seq > > > 1.1 1 1980 0 1 > > > 1.2 1 1982 1 2 > > > 1.3 1 1996 1 3 > > > 2.4 2 1985 0 1 > > > 2.5 2 1987 0 2 > > > 3.6 3 1990 0 1 > > > 3.7 3 1991 1 2 > > > 3.8 3 1992 0 3 > > > 3.9 3 1999 1 4 > > > 4.10 4 1972 1 1 > > > 4.11 4 1983 0 2 > > >> # determine first event > > >> df <- do.call(rbind, by(df, df$subject, function(x){ > > > + # determine first event > > > + .first <- cumsum(x$event.of.interest) > > > + # create sequence after first non-zero > > > + .first <- cumsum(.first > 0) > > > + x$event.seq <- .first > > > + x > > > + })) > > >> df > > > subject year event.of.interest subject.seq event.seq > > > 1.1.1 1 1980 0 1 0 > > > 1.1.2 1 1982 1 2 1 > > > 1.1.3 1 1996 1 3 2 > > > 2.2.4 2 1985 0 1 0 > > > 2.2.5 2 1987 0 2 0 > > > 3.3.6 3 1990 0 1 0 > > > 3.3.7 3 1991 1 2 1 > > > 3.3.8 3 1992 0 3 2 > > > 3.3.9 3 1999 1 4 3 > > > 4.4.10 4 1972 1 1 1 > > > 4.4.11 4 1983 0 2 2 > > > > Thanks Jim, that works a treat, over an order of magnitude faster than > > the for-loops. > > > > Anders Nielsen also provided this solution: > > > > df$subject.seq<-unlist(tapply(df$subject, > > df$subject, > > function(x)1:length(x) > > ) > > ) > > > > Doing it that way is about 5 times faster than using rbind(). But Jim's > > use of cumsum on the logical vector is very nifty. > > > > I have now combined Jim's function with Anders' column-oriented approach > > and the result is that my code now runs about two orders of magnitude > > faster. > > > > Many thanks, > > > > Tim C > > > > > > > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem you are trying to solve? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.