On Thu, 11 Jan 2007, Farrel Buchinsky wrote: > Am I correct in believing that one cannot match on multiple columns? > One can indeed subset on multiple criteria from different variables > (or columns) but not from unique combinations thereof. > I need to exclude about 10000 rows from 108000 rows of data based on > several unique combinations of identifiers in two columns. Only > merge() seems to be able to do that. Merge would allow me to > positively select but it would not allow me to deselect (or exclude). > Look at how I got around the problem. > It is inelegant. Have a missed a more direct function? >
I guess that depends on what you regard as 'more direct'. newdata[ !is.element( interaction(newdata), interaction(exclude) ) , ] > > x<-seq(from=1,to=10) > y <-rep(1:2,5) > data<-data.frame(cbind(x,y)) > x<-seq(from=1,to=10) > y <-rep(1:5,2) > data1<-data.frame(cbind(x,y)) > newdata<-rbind(data,data1) > newdata[10,2] <- 3 > exclude<-newdata[18:20,] > #This is a simulation of real life problem > #We now have two dataframes. Newdata is the data set from an > experiment. In real #life it has an additional 9 columns of data > #exclude is a dataframe that was manually created after discovering > some quality #control issues > #Any row in newdata that matches any row in exclude must be discarded > match(newdata$x,exclude$x)# useless because it is only on one column > match(newdata$y,exclude$y)# useless because it is only on one column > newdata$x %in% exclude$x # useless because it is only on one column > newdata$y %in% exclude$y # useless because it is only on one column > newdata$x %in% exclude$x & newdata$y %in% exclude$y # useless, > #eventhough it is using both columns, because it is not > #using them in a synchronous manner. Row 10 in new data should not > have been #marked "TRUE" > #It was only labeled such because the 10 in the x column is indeed in > the exclude x #column and the 3 in the y column is indeed in the > exclude y column but not #together > which(newdata$x %in% exclude$x & newdata$y %in% exclude$y)#also > gets it #wrong > match(newdata,exclude)# intuitively this could have worked but alas > match can #only handle vectors and not dataframes. It cannot match on > multiple columns > #I have to stoop to the inelegant maneuver of creating a combined > variable of the #two columns, albeit only temporarily > paste(newdata$x,newdata$y,sep=":") %in% paste(exclude$x,exclude$y,sep=":") > #or one could do this > match(paste(newdata$x,newdata$y,sep=":"),paste(exclude$x,exclude$y,sep=":")) > #or > {which(paste(newdata$x,newdata$y,sep=":") %in% paste > (exclude$x,exclude$y,sep=":"))} > #Or if you want to use the result in an index term or a selection > argument in a subset command > {paste(newdata$x,newdata$y,sep=":") %in% > paste(exclude$x,exclude$y,sep=":")==FALSE} > #As in > {newdata[paste(newdata$x,newdata$y,sep=":") %in% > paste(exclude$x,exclude$y,sep=":")==FALSE,]} > #or > {subset(newdata,paste(newdata$x,newdata$y,sep=":") %in% > paste(exclude$x,exclude$y,sep=":")==FALSE)} > > > -- > Farrel Buchinsky > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://biostat.ucsd.edu/~cberry/ La Jolla, San Diego 92093-0901 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.