On Fri, Jul 6, 2012 at 10:45 AM, mdvaan <mathijsdev...@gmail.com> wrote: > Hi, > > I am using R's grep function to find patterns in vectors of strings. The > number of patterns I would like to match is 7,700 (of different sizes). I > noticed that I get an error message when I do the following: > > data <- array() > for (j in 1:length(x)) > { > array[j] <- length(grep(paste(patterns[1:7700], collapse = "|"), x[j], > value = T)) > } > > When I break this up into 4 chunks of patterns it works: > > data <- array() > for (j in 1:length(x)) > { > array$chunk1[j] <- length(grep(paste(patterns[1:2500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[2501:5000], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[5001:7500], collapse = "|"), > x[j], value = T)) > array$chunk1[j] <- length(grep(paste(patterns[7501:7700], collapse = "|"), > x[j], value = T)) > } > > My questions: what's the maximum size of the patterns argument in grep? Is > there a way to do this faster? It is very slow.
Try strapplyc in gsubfn and see http://gsubfn.googlecode.com for more info. # test data x <- c("abcd", "z", "dbef") # re is regexp with 7700 alternatives # to test with g <- expand.grid(letters, letters, letters) gp <- do.call("paste0", g) gp7700 <- head(gp, 7700) re <- paste(gp7700, collapse = "|") # grep gives error message grep.out <- grep(re, x) # strapplyc works library(gsubfn) which(sapply(strapplyc(x, re), length) > 0) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.