> -----Original Message----- > From: hadley wickham [mailto:h.wick...@gmail.com] > Sent: Sunday, January 04, 2009 8:56 PM > To: William Dunlap > Cc: gallon...@gmail.com; R help > Subject: Re: [R] the first and last observation for each subject > > >> library(plyr) > >> > >> # ddply is for splitting up data frames and combining the results > >> # into a data frame. .(ID) says to split up the data frame by the > > subject > >> # variable > >> ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1])) > >> ... > > > > The above is much quicker than the versions based on aggregate and > > plyr does make some optimisations to increase speed and decrease > memory usage (mainly by passing around lists of indices, rather than > lists of the original objects) but it's unlikely ever to approach the > speed of a pure vector approach (although I hope to put some time into > rewriting the slow parts in C to do better with performance). > > > easy to understand. Another approach is more specialized but useful > > when you have lots of ID's (e.g., millions) and speed is > very important. > > It computes where the first and last entry for each ID in a > vectorized > > computation, akin to the computation that rle() uses: > > I particularly this solution to the problem - it's a very handy > technique, and while it takes a while to get your head around how it > works, it's worthwhile spending the time to do so because it crops up > as a useful solution to many similar types of problems. (It can be > particularly useful in excel too, as a quick way of locating > boundaries between groups) > > Hadley > > -- > http://had.co.nz/
Another application of that technique can be used to quickly compute medians by groups: gm <- function(x, group){ # medians by group: sapply(split(x,group),median) o<-order(group, x) group <- group[o] x <- x[o] changes <- group[-1] != group[-length(group)] first <- which(c(TRUE, changes)) last <- which(c(changes, TRUE)) lowerMedian <- x[floor((first+last)/2)] upperMedian <- x[ceiling((first+last)/2)] median <- (lowerMedian+upperMedian)/2 names(median) <- group[first] median } For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups (in random order) the times are: > group<-sample(1:30000, size=100000, replace=TRUE) > x<-rnorm(length(group))*10 + group > unix.time(z0<-sapply(split(x,group), median)) user system elapsed 2.72 0.00 3.20 > unix.time(z1<-gm(x,group)) user system elapsed 0.12 0.00 0.16 > identical(z1,z0) [1] TRUE Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.