Using charmatch partial matches of 10,000 5 leters words to the same list can be done in 10 seconds on my machine and 10,000 5 letter words to 100,000 10 letter words in 1 minute. Is that good enough? Try this simulation:
# generate N random words each k long rwords <- function(N, k) { L <- sample(letters, N*k, replace = TRUE) apply(matrix(L, k), 2, paste, collapse = "") } w1 <- rwords(1e5, 10) w2 <- rwords(1e4, 5) system.time(charmatch(w2, w2)) system.time(charmatch(w2, w1)) On Thu, Apr 29, 2010 at 4:05 AM, Richard R. Liu <richard....@pueo-owl.ch> wrote: > I have an application that a long list of character strings to determine which > occur at the beginning of a given word. A straight forward R script using > grep > takes a long time to run. Rewriting it to use substr and match might be an > option, but I have the impression that preparing the list as a trie and > performing trie searches might lead to dramatic improvements in performance. > > > I have searched the CRAN packages and find no packages that support Compact > Patricia Trees. Does anybody know of such? > > > Thanks, > Richard > > Richard R. Liu > richard....@pueo-owl.ch > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.