Thanks Dennis! I'll check this out. Just to clarify, I need the total number of switches/changes regardless of if that state had occurred in the past. So A-A-B-A, would have 2 changes: A to B and B to A.
Thanks again. On Wed, Aug 24, 2011 at 1:28 PM, Dennis Murphy <djmu...@gmail.com> wrote: > Hi Juliet: > > Here's a Q & D solution: > > # (1) plyr >> f <- function(d) length(unique(d$mygroup)) - 1 >> ddply(myData, .(id), f) > id V1 > 1 1 0 > 2 2 2 > 3 3 1 > 4 4 0 > > # (2) data.table > > myDT <- data.table(myData, key = 'id') > myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id'] > > If one can switch back and forth between levels more than once, then > the above is clearly not appropriate. A more robust method would be to > employ rle() [run length encoding]: > > g <- function(d) length(rle(d$mygroup)$lengths) - 1 > ddply(myData, .(id), g) # gives the same answer as above > myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id'] # ditto > > > HTH, > Dennis > > On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah <juliet.han...@gmail.com> > wrote: >> I have a data set with about 6 million rows and 50 columns. It is a >> mixture of dates, factors, and numerics. >> >> What I am trying to accomplish can be seen with the following >> simplified data, which is given as dput output below. >> >>> head(myData) >> mydate gender mygroup id >> 1 2012-03-25 F A 1 >> 2 2005-05-23 F B 2 >> 3 2005-09-08 F B 2 >> 4 2005-12-07 F B 2 >> 5 2006-02-26 F C 2 >> 6 2006-05-13 F C 2 >> >> For each id, I want to count the number of changes of the variable >> 'mygroup' that occur. For example, id=1 has 0 changes because it is >> observed only once. id=2 has 2 changes (B to C, and C to D). I also >> need to calculate the total observation time for each id using the >> variable mydate. In the end, I am trying to have a new data set in >> which each row has an id, days observed, number of changes, and >> gender. >> >> I made some simple summaries using data.table and plyr, but I'm stuck >> on this reformatting. >> >> Thanks for your help. >> >> myData <- structure(list(mydate = c("2012-03-25", "2005-05-23", "2005-09-08", >> "2005-12-07", "2006-02-26", "2006-05-13", "2006-09-01", "2006-12-12", >> "2006-02-19", "2006-05-03", "2006-04-23", "2007-12-08", "2011-03-19", >> "2007-12-20", "2008-06-15", "2008-12-16", "2009-06-07", "2009-10-09", >> "2010-01-28", "2007-06-05"), gender = c("F", "F", "F", "F", "F", >> "F", "F", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", >> "M", "M"), mygroup = c("A", "B", "B", "B", "C", "C", "C", "D", >> "D", "D", "D", "D", "D", "A", "A", "A", "B", "B", "B", "A"), >> id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, >> 3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate", "gender", >> "mygroup", "id"), class = "data.frame", row.names = c(NA, -20L >> )) >> >>> sessionInfo() >> R version 2.13.1 (2011-07-08) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.