I am using > R.version.string # Vista [1] "R version 2.8.1 Patched (2008-12-26 r47350)"
and it also caused R to actually crash for me. On Mon, Jan 26, 2009 at 1:38 PM, Stavros Macrakis <macra...@alum.mit.edu> wrote: > Given a vector of reference strings Ref and a vector of test strings > Test, I would like to find elements of Test which do not contain > elements of Ref as \b-delimited substrings. > > This can be done straightforwardly for length(Ref) < 6000 or so (R > 2.8.1 Windows) by constructing a pattern like \b(a|b|c)\b, but not for > larger Refs (see below). The easy workaround for this is to split Ref > into smaller subsets and test each subset separately. Is there a > better solution e.g. along the lines of fgrep? My real data have > length(Ref) == 60000 or more. > > -s > > ----------------------------- > > Example > > Test <- as.character(floor(runif(2000,1,20000))) # Real data is short phrases > > testing <- function(n) { > Ref <- as.character(1:n) # Real data is sentences > Pat <- paste('\\b(',paste(Ref,collapse="|"),')\\b',sep='') > grep(Pat,Test) > } > > testing(2000) => no problem > > However, testing(10000) gives an error message (invalid regular > expression) and a warning (memory exhausted), and testing(100000) > crashes R (Process R exited abnormally with code 5). > > Using grep(...,perl=TRUE) as suggested in the man page also fails with > testing(10000), though it gives a more helpful error message (regular > expression is too large) without crashing the process. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.