On Fri, Apr 22, 2011 at 11:27 AM, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > Since this involves time series within each id group I thought it > would be interesting to see if this could be formulated using > zoo series. The approach is to read it in, and convert it to a long > form by just stacking the start and stop times in a data frame > and converting that to zoo using the concatenation of date and id as > the time index. It makes use of the fact that > - zoo can use non-standard "times" (here the concatenation of date and id) > - read.zoo can split and aggregate and > - the resulting object can be directly cumsum'd using cumsum.zoo . > > Finally we convert it back to the required form. > > This was mostly done by looking at the output rather than trying to > follow the SAS code so some differences are possible. It does > seem to give the same output in this case. > > Yet another approach might be to use model.matrix to create the > NRTI/NNRTI/PI columns: > model.matrix(~ drug - 1, long2) * long2$change > but below we stick with zoo. > > Some of the code here could be transplanted into Ista's solution > (which uses the reshape and plyr packages) to get a combination > approach. > > Lines <- "1004 NRTI 07/24/95 01/05/99 > 1004 NRTI 11/20/95 12/10/95 > 1004 NRTI 01/10/96 01/05/99 > 1004 PI 05/09/96 11/16/97 > 1004 NRTI 06/01/96 02/01/97 > 1004 NRTI 07/01/96 03/01/97 > 9999 PI 01/02/03 . > 9999 NNRTI 04/05/06 07/08/09" > > library(zoo) > > # need na.fill from development version of zoo > source("http://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/zoo/R/na.fill.R?revision=831&root=zoo") > > # read in data and reshape to long form > > DF <- read.table(textConnection(Lines), as.is = TRUE, na.strings = ".", > col.names = c("id", "drug", "date", "date"), check.names = FALSE) > long <- rbind(cbind(DF[-4], change = 1), cbind(DF[-3], change = -1)) > > # convert to zoo. date.id, the concatenation of date & id, becomes index. > # Then take cumulative sums of columns and append regimen. > > long2 <- with(long, > data.frame(date.id = paste(as.Date(date, "%m/%d/%y"), id), drug, > change)) > z <- read.zoo(long2, index = 1, split = "drug", FUN = identity, aggregate = > sum) > > z <- cumsum(na.fill(z, fill = 0)) > z$regimen <- ave(1:nrow(z), sub(".* ", "", time(z)), FUN = seq_along) > > # convert z to data frame for final processing > > id <- sub(".* ", "", time(z)) > start_date = as.Date(sub(" .*", "", time(z))) > d2 <- data.frame(id, start_date, coredata(z)) > > # from each group of rows in same id form the desired columns. > # f processes one such group. >
There was a bug in f. The last part above should have been: f <- function(x) with(x, data.frame( id = head(id, -1), start_date = head(start_date, -1), stop_date = start_date[-1]-1, head(cbind(regimen, NRTI, NNRTI, PI, all = NRTI + NNRTI + PI, HAART = as.numeric((NRTI >= 3 & NNRTI==0 & PI==0) | (NRTI >= 2 & (NNRTI >= 1 | PI >= 1)) | (NRTI == 1 & NNRTI >= 1 & PI >= 1))), -1))) result <- do.call(rbind, by(d2, id, f)) rownames(result) <- NULL result > result id start_date stop_date regimen NRTI NNRTI PI all HAART 1 1004 1995-07-24 1995-11-19 1 1 0 0 1 0 2 1004 1995-11-20 1995-12-09 2 2 0 0 2 0 3 1004 1995-12-10 1996-01-09 3 1 0 0 1 0 4 1004 1996-01-10 1996-05-08 4 2 0 0 2 0 5 1004 1996-05-09 1996-05-31 5 2 0 1 3 1 6 1004 1996-06-01 1996-06-30 6 3 0 1 4 1 7 1004 1996-07-01 1997-01-31 7 4 0 1 5 1 8 1004 1997-02-01 1997-02-28 8 3 0 1 4 1 9 1004 1997-03-01 1997-11-15 9 2 0 1 3 1 10 1004 1997-11-16 1999-01-04 10 2 0 0 2 0 11 9999 2003-01-02 2006-04-04 1 0 0 1 1 0 12 9999 2006-04-05 2009-07-07 2 0 1 1 2 0 13 9999 2009-07-08 <NA> 3 0 0 1 1 0 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.