> -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Ali Salekfard > Sent: Friday, December 24, 2010 5:46 AM > To: r-help@r-project.org > Subject: [R] Removing rows with earlier dates > > Hi all, > > I'm new to the list but have benfited from it quite > extensively. Straight to > my rather strange question: > > I have a data frame that contains mapping rules in this way: > > ACCOUNT, RULE COLUMNS, Effective Date > > > The dataframe comes from a database that stores all dates. > What I would like > to do is to create a data frame with only the most recent > rule for each > account. In traditional programming languages I would loop > through each > account find the most recent rule(s) and fill up my updated > data frame.
Do you mean you would loop through the accounts and for each account loop through all rules for that account looking for the rule with the latest date? > > Does anyone have any better idea to use R's magic (Its syntax is still > magical to me) for this problem? I like to think of this sort of a problem as a one involving "runs" (sequences of identical data points). The following function identifies which data points are the last in a run: isLastInRun <- function (x, ...) { retval <- c(x[-1] != x[-length(x)], TRUE) for (y in list(...)) { stopifnot(length(x) == length(y)) retval <- retval | c(x[-1] != x[-length(x)], TRUE) } retval } E.g., > isLastInRun(rep(1:2,each=5), rep(1:3,c(3,3,4))) [1] FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE If your data is sorted by ACCOUNT with ties broken by `Effective Date` then you can get what I think you want with f0 <- function(data) { data[isLastInRun(data[,"ACCOUNT"]), ] } If it is not sorted then sort it first and then do the above with f1 <- function(data) { data <- data[order(data[,"ACCOUNT"], data[,"Effective Date"]), ] f0(data) } If it is important that the output records be in the same order as the input records then you can use f2 <- function(data) { o <- order(data[, "ACCOUNT"], data[, "Effective Date"]) tmp <- logical(length(o)) tmp[o] <- isLastInRun(data[o, "ACCOUNT"]) data[tmp,] } (It may be faster to sort the output of f1 rather that sorting the input, as f2 does, but f2's method is a bit simpler to write.) This approach is typically faster than tapply when there are a lot of small groups and runs less risk of corrupting the data when using unusual classes in the columns of your data.frame. Typical usage is > d <- data.frame(check.names=FALSE, + "Effective Date"=c(1004,1008,1004,1007,1005), + ACCOUNT= c( 11, 12, 11, 12, 11), + Rule= c( "A", "B", "C", "D", "E")) > f1(d) Effective Date ACCOUNT Rule 5 1005 11 E 2 1008 12 B > f2(d) Effective Date ACCOUNT Rule 2 1008 12 B 5 1005 11 E Did I interpret your question correctly? Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > > By the way the list of rules is quite extensive (144643 lines to be > precise), and there are usually 1-3 most recent rules (rows) for each > account. > > Thanks. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.