Oops! Spoke too soon. Your fix fixed the problem I was having before, but it turns out the test is now accepting every line. So there is still some problem with the logic or with my implimentation of it.
I thought I should produce a reproducible example without 3 million lines of data. I made a version with only the geography information and test. Here is the code I am now using, applied to a file with only the first 8 lines of my geo data in it: First I read the data in and print it out: GEOshort.DF <- read.table("C:\\Users\\andrewH\\Documents\\Oakland Tech Project\\GEO_short.csv", header = FALSE, sep = ",", quote = "\"", dec = ".", skip=1, col.names= c("originalRow", "GEO_ID", "GEOGRAPHY"), fill = TRUE, colClasses="character") Which yields: > GEOshort.DF originalRow GEO_ID GEOGRAPHY 1 1 01000US United States 2 3115 04000US01 Alabama 3 5501 04000US02 Alaska 4 7924 04000US04 Arizona 5 10571 04000US05 Arkansas 6 14342 04000US06 California 7 17913 04000US08 Colorado 8 20442 04000US09 Connecticut Then I try to select the rows that match my geo-codes: GEOextract.DF <- GEOshort.DF[ any(GEOshort.DF$GEO_ID %in% c("01000US", "04000US06", "33000US488", "31000US41860", "31400US4186036084", "05000US06001", "E6000US0600153000")), ] This produces: > GEOextract.DF originalRow GEO_ID GEOGRAPHY 1 1 01000US United States 2 3115 04000US01 Alabama 3 5501 04000US02 Alaska 4 7924 04000US04 Arizona 5 10571 04000US05 Arkansas 6 14342 04000US06 California 7 17913 04000US08 Colorado 8 20442 04000US09 Connecticut This is what I want it to produce: > GEOextract.DF originalRow GEO_ID GEOGRAPHY 1 1 01000US United States 2 14342 04000US06 California Sorry about the confusion, and thanks again for your kind attention. Peace & joy, Andrew "... But pattern-matching doesn't equal comprehension." --Peter Watts On Sat, Apr 12, 2014 at 6:04 AM, Sarah Goslee <sarah.gos...@gmail.com>wrote: > You need %in% instead. > > This is untested, but something like this should work: > > > ECwork <- EC07_A1[ EC07_A1$GEO_ID %in% c("01000US", "04000US06", > "33000US488", > "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") & > EC07_A1$SECTOR %in% c("32", "33", "42", 44", 45", 51", 54", 61", > "71", > "81"), ] > > (Note that your original code snippet had a shortage of ) and didn't > specify the data frame from which to take the columns.) > > Sarah > > On Sat, Apr 12, 2014 at 8:36 AM, Andrew Hoerner <ahoer...@rprogress.org> > wrote: > > Dear Folks-- > > I have a file with 3 million-odd rows of data from the 2007 U.S. Economic > > Census. I am trying to pare it down to a subset of rows that both (1) has > > any one of a vector of NAICS economic sector codes, and (2) also has any > > one of a vector of geographic ID codes. > > > > Here is the code I am trying to use. > > > > ECwork <- EC07_A1[ any(GEO_ID == c("01000US", "04000US06", > "33000US488", > > "31000US41860", "31400US4186036084" "05000US06001", "E6000US0600153000") > & > > any(SECTOR == c("32", "33", "42", 44", 45", 51", 54", 61", "71", > > "81"), ] > > > > I get back the following error: > > > > Warning message: > > In EC07_A1$SECTOR == c("32", "33", "42", "44", "45", "51", "54", : > > longer object length is not a multiple of shorter object length > > > > I see what R is doing. Instead of comparing each element of the column > > SECTOR to the row vector of codes, and returning a logical vector of the > > length of SECTOR with rows marked as TRUE that match any of the codes, it > > is lining my code list up with SECTOR as a column vector and doing > > element-by-element testing, and then recycling the code list over three > > million rows. But I am not sure how to make it do what I want -- test the > > sector code in each row against the vector of code I am looking for. I > > would be grateful if anyone could suggest an alternative that would > achieve > > my ends. > > > > Oh, and I would add, if there is a way of correctly using doing this with > > the extract function [], I would like to know what it is. If not, I guess > > I'd like to know that too. > > > > Sincerely, Andrew Hoerner > > > > -- > > J. Andrew Hoerner > > Director, Sustainable Economics Program > > Redefining Progress > > (510) 507-4820 > > > -- > Sarah Goslee > http://www.functionaldiversity.org > -- J. Andrew Hoerner Director, Sustainable Economics Program Redefining Progress (510) 507-4820 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.