Thanks Bill! Works great! Thanks again guys! On Fri, Aug 10, 2012 at 2:43 PM, William Dunlap <wdun...@tibco.com> wrote:
> If you think about this as a runs problem you can get a loopless solution > that I think is easier to read (once the requisite functions are defined). > > First define the function to canonicalize the name > nickname <- function(x) sub(" .*", "", x) > then define some handy runs functions > isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)]) > isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical > then use those functions on your dataset > > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR) > > d[ nearDup | isJustBefore(nearDup), ] > ID NAME YEAR SOURCE > 1 1 New York Mets 1900 ESPN > 2 2 New York Yankees 1920 Cooperstown > See how it works with triplicates as well > > dd <- rbind(d, data.frame(ID=6:8, > NAME=c("Chicago Blacksox", "Chicago Cubs", > "Chicago Whitesox"), > YEAR=1701:1703, SOURCE=rep("made up", 3))) > > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR) > > dd[ nearDup | isJustBefore(nearDup), ] > ID NAME YEAR SOURCE > 1 1 New York Mets 1900 ESPN > 2 2 New York Yankees 1920 Cooperstown > 6 6 Chicago Blacksox 1701 made up > 7 7 Chicago Cubs 1702 made up > 8 8 Chicago Whitesox 1703 made up > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > > -----Original Message----- > > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf > > Of Rui Barradas > > Sent: Friday, August 10, 2012 11:18 AM > > To: Fred G > > Cc: r-help > > Subject: Re: [R] Regular Expressions + Matrices > > > > Hello, > > > > Try the following. > > > > > > d <- read.table(textConnection(" > > ID NAME YEAR SOURCE > > 1 'New York Mets' 1900 ESPN > > 2 'New York Yankees' 1920 Cooperstown > > 3 'Boston Redsox' 1918 ESPN > > 4 'Washington Nationals' 2010 ESPN > > 5 'Detroit Tigers' 1990 ESPN > > "), header=TRUE) > > > > d$NAME <- as.character(d$NAME) > > > > fun <- function(i, x){ > > if(x[i, "ID"] != x[i + 1, "ID"]){ > > s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1] > > if(grepl(s, x[i + 1, "NAME"])) return(TRUE) > > } > > FALSE > > } > > > > inx <- sapply(seq_len(nrow(d) - 1), fun, d) > > inx <- c(inx, FALSE) | c(FALSE, inx) > > d[inx, ] > > > > Hope this helps, > > > > Rui Barradas > > Em 10-08-2012 18:41, Fred G escreveu: > > > Hi all, > > > > > > My code looks like the following: > > > inname = read.csv("ID_error_checker.csv", as.is=TRUE) > > > outname = read.csv("output.csv", as.is=TRUE) > > > > > > #My algorithm is the following: > > > #for line in inname > > > #if first string up to whitespace in row in inname$name = first string > up > > > to whitespace in row + 1 in inname$name > > > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the > row > > > below it > > > #copy these two lines to a new file > > > > > > In other words, if the name (up to the first whitespace) in the first > row > > > equals the name in the second row (etc for whole file) and the ID in > the > > > first row does not equal the ID in the second row, copy both of these > rows > > > in full to a new file. Only caveat is that I want a regular > expression not > > > to take the full names, but just the first string up to the first > > > whitespace in the inname$name column (ie if row1 has a name of: New > York > > > Mets and row2 has a name of New York Yankees, I would want both of > these > > > rows to be copied in full since "New" is the same in both...) > > > > > > Here is some example data: > > > ID NAME YEAR SOURCE NOTES > > > 1 New York Mets 1900 ESPN > > > 2 New York Yankees 1920 Cooperstown > > > 3 Boston Redsox 1918 ESPN > > > 4 Washington Nationals 2010 ESPN > > > 5 Detroit Tigers 1990 ESPN > > > > > > The desired output would be: > > > ID NAME YEAR SOURCE > > > 1 New York Mets 1900 ESPN > > > 2 New York Yankees 1920 Cooperstown > > > > > > Thanks so much! > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.