On Sep 8, 2009, at 12:39 AM, Steven Kang wrote:
Dear R users,
Suppose I have a data set with inconsistent names for a field.
I desire to make these to consistent names.
i.e
"University of New Jersey", "New Jersey Uni", "New Jersey
University" (3
different inconsistent names) to "The University of New
Jersey" (consistent
name)
Below are arbitrary data set produced from "state.name" (built in
data set
in R) and associated scripts.
d <- as.data.frame(c(state.name[30:40], paste(state.name[30:40],
"University", sep=" "), paste("Th University of", state.name[30:40],
sep="
"),paste("University o", state.name[30:40], sep=" ")))
da <- sapply(d, as.character) # factor to character transformation
spl <- strsplit(da, " ") # spliting components
dd <- character(dim(da)[1]) # initializing empty vector
for (i in 1:dim(da)[1]) {
if (sum(c("New", "Jersey", "University") %in% spl[[i]]) >= 3)
dd[i] <- "The University of New Jersey"
else if (sum(c("New", "Mexico", "University") %in% spl[[i]]) >= 3)
dd[i] <- "The University of New Mexico"
else if (sum(c("New", "York") %in% spl[[i]]) >=
2) dd[i] <- "The University of New York"
else if (sum(c("North", "Carolina") %in% spl[[i]]) >=
2) dd[i] <- "The university of North Carolina"
}
Note: above shows only partial (if/else if) conditions.
The if (cond ){ }else{} construct is for program control rather
revision of vectors. You should consider using the <- ifelse(cond )
val1 , val2) construct.
Q1: The above "for" loop works fine (but very slow on large data
set..),
thus I would like to explore whether there is an alternative
VECTORIZATION
method that may speed up the process.
Q2: Also, is there other way to extract a string from a phrase
without using
"%in%"?
Many grep-isch functions are available that are vectorised regular
expression "machines".
? grep will show quite a few.
i.e
"ac" %in% unlist(strsplit("ac dc", " "))
[1] TRUE
--
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.