Tim,

If you just want the event.seq, you should be able to get really fast
results with:

unsplit(
 lapply(split(df$event.of.interest, df$subject),
  FUN = function(x) {
   cumsum(cumsum(x)>0)
  }
 ),
df$subject)

Doesn't produce a matrix, but it's fast and you could cbind it afterwards.
That also turns repetitive cbind calls into a single one, which should
greatly increase efficiency. You could produce the subject.seq similarly by
changing function(x) to 1:length(x), but that would mean spliting the data
up twice (although it may still produce faster results than the previous
options.. let us know).

-Tariq



On 2/15/07, jim holtman <[EMAIL PROTECTED]> wrote:
>
> One concern that I have with the 'tapply' approach is that it does not
> create the correct results if the data it not in sorted order.  See the
> example below:
>
> > # generate an unsorted set of data
> > set.seed(123)
> > x <- data.frame(a=sample(1:3,12,TRUE), b=sample(0:1, 12, TRUE))
> > x
>   a b
> 1  1 1
> 2  3 1
> 3  2 0
> 4  3 1
> 5  3 0
> 6  1 0
> 7  2 0
> 8  3 1
> 9  2 1
> 10 2 1
> 11 3 1
> 12 2 1
> > # using the tapply to sequence with grouping by 'a'; doesn't create
> right
> results
> > # because it put the data back in the 'sorted' order, but data unsorted
> > x$seq<-unlist(tapply(x$a, x$a, function(x)1:length(x)))
> > x
>   a b seq
> 1  1 1   1
> 2  3 1   2
> 3  2 0   1
> 4  3 1   2
> 5  3 0   3
> 6  1 0   4
> 7  2 0   5
> 8  3 1   1
> 9  2 1   2
> 10 2 1   3
> 11 3 1   4
> 12 2 1   5
> > # using 'by' to group and then sequence
> > do.call(rbind, by(x, x$a, function(z) {z$new.seq <- 1:nrow(z); z}))
>     a b seq new.seq
> 1.1  1 1   1       1
> 1.6  1 0   4       2
> 2.3  2 0   1       1
> 2.7  2 0   5       2
> 2.9  2 1   2       3
> 2.10 2 1   3       4
> 2.12 2 1   5       5
> 3.2  3 1   2       1
> 3.4  3 1   2       2
> 3.5  3 0   3       3
> 3.8  3 1   1       4
> 3.11 3 1   4       5
> >
>
>
>
> On 2/14/07, Tim Churches <[EMAIL PROTECTED]> wrote:
> >
> > jim holtman wrote:
> > > On 2/14/07, Tim Churches <[EMAIL PROTECTED]> wrote:
> > >> Any advice, tips, clues or pointers to resources on how best to speed
> > up
> > >> or, better still, avoid the loops in the following example code much
> > >> appreciated. My actual dataset has several tens of thousands of rows
> > and
> > >> lots of columns, and these loops take a rather long time to run.
> > >> Everything else which I need to do is done using vectors and those
> > parts
> > >> all run very quickly indeed. I spent quite a while doing searches on
> > >> r-help and re-reading the various manuals, but couldn't find any
> > >> existing relevant advice. I am sure the solution is obvious, but it
> > >> escapes me.
> > >>
> > >> Tim C
> > >>
> > >> # create an example data frame, multiple events per subject
> > >>
> > >> year <- c(1980,1982,1996,1985,1987,1990,1991,1992,1999,1972,1983)
> > >> event.of.interest <- c(F,T,T,F,F,F,T,F,T,T,F)
> > >> subject <- c(1,1,1,2,2,3,3,3,3,4,4)
> > >> df <- data.frame(cbind(subject,year,event.of.interest))
> > >>
> > >> # add a per-subject sequence number
> > >>
> > >> df$subject.seq <- 1
> > >> for (i in 2:nrow(df)) {
> > >> if (df$subject[i-1] == df$subject[i]) df$subject.seq[i] <-
> > >> df$subject.seq[i-1] + 1
> > >> }
> > >> df
> > >
> > > # add an event sequence number which is zero until the first
> > >> # event of interest for that subject happens, and then increments
> > >> # thereafter
> > >>
> > >> df$event.seq <- 0
> > >> for (i in 1:nrow(df)) {
> > >> if (df$subject.seq[i] == 1 ) {
> > >>    current.event.seq <- 0
> > >> }
> > >> if (event.of.interest[i] == 1 | current.event.seq > 0)
> > >> current.event.seq <- current.event.seq + 1
> > >> df$event.seq[i] <- current.event.seq
> > >> }
> > >> df
> > >
> > >
> > >
> > > try:
> > >
> > >> df <- data.frame(cbind(subject,year,event.of.interest))
> > >> df <- do.call(rbind,by(df, df$subject, function(z){z$subject.seq <-
> > > seq(nrow(z)); z}))
> > >> df
> > >      subject year event.of.interest subject.seq
> > > 1.1        1 1980                 0           1
> > > 1.2        1 1982                 1           2
> > > 1.3        1 1996                 1           3
> > > 2.4        2 1985                 0           1
> > > 2.5        2 1987                 0           2
> > > 3.6        3 1990                 0           1
> > > 3.7        3 1991                 1           2
> > > 3.8        3 1992                 0           3
> > > 3.9        3 1999                 1           4
> > > 4.10       4 1972                 1           1
> > > 4.11       4 1983                 0           2
> > >> # determine first event
> > >> df <- do.call(rbind, by(df, df$subject, function(x){
> > > +     # determine first event
> > > +     .first <- cumsum(x$event.of.interest)
> > > +     # create sequence after first non-zero
> > > +     .first <- cumsum(.first > 0)
> > > +     x$event.seq <- .first
> > > +     x
> > > + }))
> > >> df
> > >        subject year event.of.interest subject.seq event.seq
> > > 1.1.1        1 1980                 0           1         0
> > > 1.1.2        1 1982                 1           2         1
> > > 1.1.3        1 1996                 1           3         2
> > > 2.2.4        2 1985                 0           1         0
> > > 2.2.5        2 1987                 0           2         0
> > > 3.3.6        3 1990                 0           1         0
> > > 3.3.7        3 1991                 1           2         1
> > > 3.3.8        3 1992                 0           3         2
> > > 3.3.9        3 1999                 1           4         3
> > > 4.4.10       4 1972                 1           1         1
> > > 4.4.11       4 1983                 0           2         2
> >
> > Thanks Jim, that works a treat, over an order of magnitude faster than
> > the for-loops.
> >
> > Anders Nielsen also provided this solution:
> >
> > df$subject.seq<-unlist(tapply(df$subject,
> >                                  df$subject,
> >                                  function(x)1:length(x)
> >                               )
> >                        )
> >
> > Doing it that way is about 5 times faster than using rbind(). But Jim's
> > use of cumsum on the logical vector is very nifty.
> >
> > I have now combined Jim's function with Anders' column-oriented approach
> > and the result is that my code now runs about two orders of magnitude
> > faster.
> >
> > Many thanks,
> >
> > Tim C
> >
> >
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to